{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**基于Jupyter的特征工程笔记本2: 特征选择**  \n",
    "*作者: 陈颖祥，杨子唅*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Feature-Selection-特征选择\" data-toc-modified-id=\"Feature-Selection-特征选择-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Feature Selection 特征选择</a></span><ul class=\"toc-item\"><li><span><a href=\"#Filter-Methods-过滤法\" data-toc-modified-id=\"Filter-Methods-过滤法-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Filter Methods 过滤法</a></span><ul class=\"toc-item\"><li><span><a href=\"#Univariate-Filter-Methods-单变量特征过滤\" data-toc-modified-id=\"Univariate-Filter-Methods-单变量特征过滤-1.1.1\"><span class=\"toc-item-num\">1.1.1&nbsp;&nbsp;</span>Univariate Filter Methods 单变量特征过滤</a></span><ul class=\"toc-item\"><li><span><a href=\"#Variance-Threshold-方差选择法\" data-toc-modified-id=\"Variance-Threshold-方差选择法-1.1.1.1\"><span class=\"toc-item-num\">1.1.1.1&nbsp;&nbsp;</span>Variance Threshold 方差选择法</a></span></li><li><span><a href=\"#Pearson-Correlation-(regression-problem)--皮尔森相关系数-(回归问题)\" data-toc-modified-id=\"Pearson-Correlation-(regression-problem)--皮尔森相关系数-(回归问题)-1.1.1.2\"><span class=\"toc-item-num\">1.1.1.2&nbsp;&nbsp;</span>Pearson Correlation (regression problem)  皮尔森相关系数 (回归问题)</a></span></li><li><span><a href=\"#Distance-Correlation-(regression-problem)-距离相关系数-(回归问题)\" data-toc-modified-id=\"Distance-Correlation-(regression-problem)-距离相关系数-(回归问题)-1.1.1.3\"><span class=\"toc-item-num\">1.1.1.3&nbsp;&nbsp;</span>Distance Correlation (regression problem) 距离相关系数 (回归问题)</a></span></li><li><span><a href=\"#F-Score-(regression-problem)-F-统计量-(回归问题)\" data-toc-modified-id=\"F-Score-(regression-problem)-F-统计量-(回归问题)-1.1.1.4\"><span class=\"toc-item-num\">1.1.1.4&nbsp;&nbsp;</span>F-Score (regression problem) F-统计量 (回归问题)</a></span></li><li><span><a href=\"#Mutual-Information-(regression-problem)-互信息-(回归问题)\" data-toc-modified-id=\"Mutual-Information-(regression-problem)-互信息-(回归问题)-1.1.1.5\"><span class=\"toc-item-num\">1.1.1.5&nbsp;&nbsp;</span>Mutual Information (regression problem) 互信息 (回归问题)</a></span></li><li><span><a href=\"#Chi-squared-Statistics-(classification-problem)-卡方统计量-(分类问题)\" data-toc-modified-id=\"Chi-squared-Statistics-(classification-problem)-卡方统计量-(分类问题)-1.1.1.6\"><span class=\"toc-item-num\">1.1.1.6&nbsp;&nbsp;</span>Chi-squared Statistics (classification problem) 卡方统计量 (分类问题)</a></span></li><li><span><a href=\"#F-Score-(classification-problem)-F-统计量-(分类问题)\" data-toc-modified-id=\"F-Score-(classification-problem)-F-统计量-(分类问题)-1.1.1.7\"><span class=\"toc-item-num\">1.1.1.7&nbsp;&nbsp;</span>F-Score (classification problem) F-统计量 (分类问题)</a></span></li><li><span><a href=\"#Mutual-Information-(classification-problem)-互信息-(分类问题)\" data-toc-modified-id=\"Mutual-Information-(classification-problem)-互信息-(分类问题)-1.1.1.8\"><span class=\"toc-item-num\">1.1.1.8&nbsp;&nbsp;</span>Mutual Information (classification problem) 互信息 (分类问题)</a></span></li></ul></li><li><span><a href=\"#Multivariate-Filter-Methods-多元特征过滤\" data-toc-modified-id=\"Multivariate-Filter-Methods-多元特征过滤-1.1.2\"><span class=\"toc-item-num\">1.1.2&nbsp;&nbsp;</span>Multivariate Filter Methods 多元特征过滤</a></span><ul class=\"toc-item\"><li><span><a href=\"#Max-Relevance-Min-Redundancy-(mRMR)-最大相关最小冗余\" data-toc-modified-id=\"Max-Relevance-Min-Redundancy-(mRMR)-最大相关最小冗余-1.1.2.1\"><span class=\"toc-item-num\">1.1.2.1&nbsp;&nbsp;</span>Max-Relevance Min-Redundancy (mRMR) 最大相关最小冗余</a></span></li><li><span><a href=\"#Correlation-based-Feature-Selection-(CFS)-基于相关性的特征选择\" data-toc-modified-id=\"Correlation-based-Feature-Selection-(CFS)-基于相关性的特征选择-1.1.2.2\"><span class=\"toc-item-num\">1.1.2.2&nbsp;&nbsp;</span>Correlation-based Feature Selection (CFS) 基于相关性的特征选择</a></span></li><li><span><a href=\"#Fast-Correlation-based-Filter-(FCBF)-基于相关性的快速特征选择\" data-toc-modified-id=\"Fast-Correlation-based-Filter-(FCBF)-基于相关性的快速特征选择-1.1.2.3\"><span class=\"toc-item-num\">1.1.2.3&nbsp;&nbsp;</span>Fast Correlation-based Filter (FCBF) 基于相关性的快速特征选择</a></span></li><li><span><a href=\"#ReliefF\" data-toc-modified-id=\"ReliefF-1.1.2.4\"><span class=\"toc-item-num\">1.1.2.4&nbsp;&nbsp;</span>ReliefF</a></span></li><li><span><a href=\"#Spectral-Feature-Selection-(SPEC)-基于谱图的特征选择\" data-toc-modified-id=\"Spectral-Feature-Selection-(SPEC)-基于谱图的特征选择-1.1.2.5\"><span class=\"toc-item-num\">1.1.2.5&nbsp;&nbsp;</span>Spectral Feature Selection (SPEC) 基于谱图的特征选择</a></span></li></ul></li></ul></li><li><span><a href=\"#Wrapper-Methods-封装方法\" data-toc-modified-id=\"Wrapper-Methods-封装方法-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Wrapper Methods 封装方法</a></span><ul class=\"toc-item\"><li><span><a href=\"#Deterministic-Algorithms-确定性算法\" data-toc-modified-id=\"Deterministic-Algorithms-确定性算法-1.2.1\"><span class=\"toc-item-num\">1.2.1&nbsp;&nbsp;</span>Deterministic Algorithms 确定性算法</a></span><ul class=\"toc-item\"><li><span><a href=\"#Recursive-Feature-Elimination-(SBS)-递归式特征消除\" data-toc-modified-id=\"Recursive-Feature-Elimination-(SBS)-递归式特征消除-1.2.1.1\"><span class=\"toc-item-num\">1.2.1.1&nbsp;&nbsp;</span>Recursive Feature Elimination (SBS) 递归式特征消除</a></span></li></ul></li><li><span><a href=\"#Randomized-Algorithms-随机方法\" data-toc-modified-id=\"Randomized-Algorithms-随机方法-1.2.2\"><span class=\"toc-item-num\">1.2.2&nbsp;&nbsp;</span>Randomized Algorithms 随机方法</a></span><ul class=\"toc-item\"><li><span><a href=\"#Simulated-Annealing-(SA)-基于模拟退火特征选择\" data-toc-modified-id=\"Simulated-Annealing-(SA)-基于模拟退火特征选择-1.2.2.1\"><span class=\"toc-item-num\">1.2.2.1&nbsp;&nbsp;</span>Simulated Annealing (SA) 基于模拟退火特征选择</a></span></li><li><span><a href=\"#Genetic-Algorithm-(GA)-基于基因算法特征选择\" data-toc-modified-id=\"Genetic-Algorithm-(GA)-基于基因算法特征选择-1.2.2.2\"><span class=\"toc-item-num\">1.2.2.2&nbsp;&nbsp;</span>Genetic Algorithm (GA) 基于基因算法特征选择</a></span></li></ul></li></ul></li><li><span><a href=\"#Embedded-Methods-嵌入方法\" data-toc-modified-id=\"Embedded-Methods-嵌入方法-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Embedded Methods 嵌入方法</a></span><ul class=\"toc-item\"><li><span><a href=\"#基于正则化模型的方法\" data-toc-modified-id=\"基于正则化模型的方法-1.3.1\"><span class=\"toc-item-num\">1.3.1&nbsp;&nbsp;</span>基于正则化模型的方法</a></span><ul class=\"toc-item\"><li><span><a href=\"#Lasso-Regression-(Linear-Regression-with-L1-Norm)-套索回归\" data-toc-modified-id=\"Lasso-Regression-(Linear-Regression-with-L1-Norm)-套索回归-1.3.1.1\"><span class=\"toc-item-num\">1.3.1.1&nbsp;&nbsp;</span>Lasso Regression (Linear Regression with L1 Norm) 套索回归</a></span></li><li><span><a href=\"#Logistic-Regression-(with-L1-Norm)-逻辑回归\" data-toc-modified-id=\"Logistic-Regression-(with-L1-Norm)-逻辑回归-1.3.1.2\"><span class=\"toc-item-num\">1.3.1.2&nbsp;&nbsp;</span>Logistic Regression (with L1 Norm) 逻辑回归</a></span></li><li><span><a href=\"#LinearSVR/-LinearSVC-线性向量支持机\" data-toc-modified-id=\"LinearSVR/-LinearSVC-线性向量支持机-1.3.1.3\"><span class=\"toc-item-num\">1.3.1.3&nbsp;&nbsp;</span>LinearSVR/ LinearSVC 线性向量支持机</a></span></li></ul></li><li><span><a href=\"#Tree-Based-Methods-基于树模型的方法\" data-toc-modified-id=\"Tree-Based-Methods-基于树模型的方法-1.3.2\"><span class=\"toc-item-num\">1.3.2&nbsp;&nbsp;</span>Tree Based Methods 基于树模型的方法</a></span></li></ul></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Reference**\n",
    "- https://scikit-learn.org/stable/modules/feature_selection.html\n",
    "- https://machinelearningmastery.com/an-introduction-to-feature-selection/\n",
    "- https://dcor.readthedocs.io/en/latest/functions/dcor.independence.distance_covariance_test.html\n",
    "- https://en.wikipedia.org/wiki/Distance_correlation\n",
    "- https://stats.stackexchange.com/questions/56881/whats-the-relationship-between-r2-and-f-test\n",
    "- https://en.wikipedia.org/wiki/Mutual_information\n",
    "- https://libguides.library.kent.edu/SPSS/ChiSquare\n",
    "- https://chrisalbon.com/machine_learning/feature_selection/anova_f-value_for_feature_selection/\n",
    "- https://online.stat.psu.edu/stat414/node/218/\n",
    "- http://featureselection.asu.edu/algorithms.php\n",
    "- https://en.wikipedia.org/wiki/Minimum_redundancy_feature_selection\n",
    "- Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence, (8), 1226-1238.\n",
    "- Hall, M. A., & Smith, L. A. (1999, May). Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In FLAIRS conference (Vol. 1999, pp. 235-239).\n",
    "- Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 856-863).\n",
    "- Zhao, Z., & Liu, H. (2007, June). Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th international conference on Machine learning (pp. 1151-1157). ACM.\n",
    "- Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning, 53(1-2), 23-69.\n",
    "- https://machinelearningmastery.com/an-introduction-to-feature-selection/\n",
    "- http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/\n",
    "- https://cloud.tencent.com/developer/article/1087796"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Selection 特征选择"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据预处理后，我们生成了大量的新变量（比如独热编码生成了大量仅包含0或1的变量）。但实际上，部分新生成的变量可能是多余：一方面它们本身不一定包含有用的信息，故无法提高模型性能；另一方面过这些多余变量在构建模型时会消耗大量内存和计算能力。因此，我们应该进行特征选择并选择特征子集进行建模。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter Methods 过滤法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "过滤法通过使用一些统计量或假设检验结果为每个变量打分。得分较高的功能往往更加重要，因此应被包含在子集中。以下为一个简单的基于过滤法的机器学习工作流（以最简单的训练-验证-测试这种数据集划分方法为例）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image](../images/过滤法工作流.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Univariate Filter Methods 单变量特征过滤"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "单变量过滤方法依据单变量统计量或统计检验选择最佳特征。其仅仅考虑单个变量与目标变量的关系（方差选择法仅基于单个变量）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Variance Threshold 方差选择法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "方差选择法删除变量方差低于某个阈值的所有特征。例如，我们应删除方差为零的特征（所有观测点中具有相同值的特征），因为该特征无法解释目标变量的任何变化。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:26:35.507588Z",
     "start_time": "2020-03-30T03:26:34.556950Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.feature_selection import VarianceThreshold\n",
    "\n",
    "# 合成一些数据集用于演示\n",
    "train_set = np.array([[1,2,3],[1,4,7],[1,4,9]]) # 可见第一个变量方差为0\n",
    "# array([[1, 2, 3],\n",
    "#        [1, 4, 7],\n",
    "#        [1, 4, 9]])\n",
    "\n",
    "test_set = np.array([[3,2,3],[1,2,7]]) # 故意将第二个变量方差设为0\n",
    "# array([[3, 2, 3],\n",
    "#        [1, 2, 7]])\n",
    "\n",
    "selector = VarianceThreshold()\n",
    "selector.fit(train_set) # 在训练集上训练\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "# 下面为返回结果，可见第一个变量已被删除\n",
    "# array([[2, 3],\n",
    "#        [4, 7],\n",
    "#        [4, 9]])\n",
    "\n",
    "transformed_test = selector.transform(test_set) # 转换测试集\n",
    "# 下面为返回结果，可见第一个变量已被删除\n",
    "# array([[2, 3],\n",
    "#        [2, 7]])\n",
    "# 虽然测试集中第二个变量的方差也为0\n",
    "# 但是我们的选择是基于训练集，所以我们依然删除第一个变量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pearson Correlation (regression problem)  皮尔森相关系数 (回归问题)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "皮尔森相关系数一般用于衡量两个**连续**变量之间的线性相关性，也可以用于衡量二元变量与目标变量的相关性。故可以将类别变量利用独热编码转换为多个二元变量之后利用皮尔森相关系数进行筛选。\n",
    "\n",
    "公式：\n",
    "$r = \\frac{\\sum_{i=1}^{n}(X_i-\\bar{X})(Y_i-\\bar{Y})}{\\sqrt{(X_i-\\bar{X})^2}\\sqrt{(Y_i-\\bar{Y})^2}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:26:35.549368Z",
     "start_time": "2020-03-30T03:26:35.509833Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy.stats import pearsonr\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "# 此数据集中，X，y均为连续变量，故此满足使用皮尔森相关系数的条件\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# sklearn 中没有直接的函数可以使用\n",
    "# 此处将用 scipy.stats.pearsonr函数来实现基于皮尔森相关系数的特征过滤\n",
    "# 注意 scipy.stats.pearsonr 计算的是两个变量之间的相关系数\n",
    "# 因sklearn SelectKBest需要，我们将基于scipy.stats.pearsonr 重写允许多特征同时输入的函数 udf_pearsonr\n",
    "\n",
    "def udf_pearsonr(X, y):\n",
    "    # 将会分别计算每一个变量与目标变量的关系\n",
    "    result = np.array([pearsonr(x, y) for x in X.T]) # 包含(皮尔森相关系数, p值) 的列表\n",
    "    return np.absolute(result[:,0]), result[:,1] \n",
    "\n",
    "# SelectKBest 将会基于一个判别函数自动选择得分高的变量\n",
    "# 这里的判别函数为皮尔森相关系数\n",
    "selector = SelectKBest(udf_pearsonr, k=2) # k => 我们想要选择的变量数\n",
    "selector.fit(train_set, train_y) # 在训练集上训练\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_train.shape #(15000, 2), 其选择了第一个及第七个变量 \n",
    "assert np.array_equal(transformed_train, train_set[:,[0,6]]) \n",
    "\n",
    "transformed_test = selector.transform(test_set) # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,6]]);\n",
    "# 可见对于测试集，其依然选择了第一个及第七个变量 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:26:35.560082Z",
     "start_time": "2020-03-30T03:26:35.551499Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1个变量和目标的皮尔森相关系数的绝对值为0.7, p-值为0.0\n",
      "第2个变量和目标的皮尔森相关系数的绝对值为0.07, p-值为0.0\n",
      "第3个变量和目标的皮尔森相关系数的绝对值为0.14, p-值为0.0\n",
      "第4个变量和目标的皮尔森相关系数的绝对值为0.04, p-值为0.0\n",
      "第5个变量和目标的皮尔森相关系数的绝对值为0.02, p-值为0.011\n",
      "第6个变量和目标的皮尔森相关系数的绝对值为0.05, p-值为0.0\n",
      "第7个变量和目标的皮尔森相关系数的绝对值为0.23, p-值为0.0\n",
      "第8个变量和目标的皮尔森相关系数的绝对值为0.08, p-值为0.0\n"
     ]
    }
   ],
   "source": [
    "# 验算一下我们的结果\n",
    "for idx in range(train_set.shape[1]):\n",
    "    pea_score, p_value = pearsonr(train_set[:,idx], train_y)\n",
    "    print(f\"第{idx + 1}个变量和目标的皮尔森相关系数的绝对值为{round(np.abs(pea_score),2)}, p-值为{round(p_value,3)}\")\n",
    "# 应选择第一个及第七个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Distance Correlation (regression problem) 距离相关系数 (回归问题)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与皮尔森相关系数类似，距离相关系数也一般被用于衡量两个连续变量之间的相关性。但与皮尔森相关系数不同的是，距离相关系数还衡量了两个变量之间的非线性关联。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式:  \n",
    "  \n",
    "首先，计算(n x n)距离矩阵dX。dX中的每一个元素为$dX_{ij}$ 。类似的，我们也可以计算距离矩阵dY，其中dY中的每个元素为$dY_{ij}$。$dX_{ij}$是为观测点i与观测点j之间的距离:     \n",
    "  \n",
    "$dX_{ij} = \\left \\| X_i - X_j \\right \\|$  \n",
    "$dY_{ij} = \\left \\| Y_i - Y_j \\right \\|$\n",
    "   \n",
    "其次，我们计算如下双中心距离并更新距离矩阵。其中，$\\bar{X_i}$为距离矩阵dX的第i行平均值，$\\bar{X_j}$为距离矩阵dX的第j列的平均值， $\\sum_i^N \\sum_j^N dX_{ij}$ 为全局平均值： \n",
    "  \n",
    "$dX_{ij} = dX_{ij} - \\bar{X_i} - \\bar{X_j} + \\frac{1}{N^2} \\sum_i^N \\sum_j^N dX_{ij}$  \n",
    "$dY_{ij} = dY_{ij} - \\bar{Y_i} - \\bar{Y_j} + \\frac{1}{N^2} \\sum_i^N \\sum_j^N dY_{ij}$   \n",
    "  \n",
    "随后，我们便可以算出样本距离协方差及距离方差：  \n",
    "  \n",
    "$dCov^2 (X, Y) = \\frac{1}{N^2} \\sum_{i}^{N} \\sum_{j}^{N} dX_{ij}dY_{ij}$  \n",
    "$dVar^2(X) = Cov^2_D (X, X)$  \n",
    "  \n",
    "最后，距离相关系数$dCor(X,Y)$ 便为如下： \n",
    "  \n",
    "$dCor(X,Y) = \\frac{dCov_D(X, Y)}{\\sqrt{dVar^2(X)}\\sqrt{dVar^2(Y)} }$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:34:59.679022Z",
     "start_time": "2020-03-30T03:26:35.561654Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from dcor import distance_correlation\n",
    "from dcor.independence import distance_covariance_test\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "# 此数据集中，X，y均为连续变量，故此满足使用距离相关系数的条件\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# sklearn 中没有直接的函数可以使用\n",
    "# 此处将用 dcor.distance_correlation函数来实现基于距离相关系数的特征过滤\n",
    "# 注意 dcor.distance_correlation 计算的是两个变量之间的相关系数\n",
    "# 因sklearn SelectKBest需要，我们将基于dcor.distance_correlation 重写允许多特征同时输入的函数 udf_dcorr\n",
    "\n",
    "def udf_dcorr(X, y):\n",
    "    # 将会分别计算每一个变量与目标变量的关系\n",
    "    result = np.array([[distance_correlation(x, y), \n",
    "                        distance_covariance_test(x,y)[0]]for x in X.T]) # 包含(距离相关系数, p值) 的列表\n",
    "    return result[:,0], result[:,1]\n",
    "\n",
    "# SelectKBest 将会基于一个判别函数自动选择得分高的变量\n",
    "# 这里的判别函数为距离相关系数\n",
    "selector = SelectKBest(udf_dcorr, k=2) # k => 我们想要选择的变量数\n",
    "selector.fit(train_set, train_y) # 在训练集上训练\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_train.shape #(15000, 2), 其选择了第一个及第三个变量 \n",
    "assert np.array_equal(transformed_train, train_set[:,[0,2]])\n",
    "\n",
    "transformed_test = selector.transform(test_set) # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,2]]);\n",
    "# 可见对于测试集，其依然选择了第一个及第三个变量 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:47.881323Z",
     "start_time": "2020-03-30T03:34:59.700756Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1个变量和目标的距离相关系数为0.66, p-值为1.0\n",
      "第2个变量和目标的距离相关系数为0.07, p-值为1.0\n",
      "第3个变量和目标的距离相关系数为0.31, p-值为1.0\n",
      "第4个变量和目标的距离相关系数为0.12, p-值为1.0\n",
      "第5个变量和目标的距离相关系数为0.08, p-值为1.0\n",
      "第6个变量和目标的距离相关系数为0.29, p-值为1.0\n",
      "第7个变量和目标的距离相关系数为0.25, p-值为1.0\n",
      "第8个变量和目标的距离相关系数为0.19, p-值为1.0\n"
     ]
    }
   ],
   "source": [
    "# 验算一下我们的结果\n",
    "for idx in range(train_set.shape[1]):\n",
    "    d_score = distance_correlation(train_set[:,idx], train_y)\n",
    "    p_value = distance_covariance_test(train_set[:,idx], train_y)[0]\n",
    "    print(f\"第{idx + 1}个变量和目标的距离相关系数为{round(d_score,2)}, p-值为{round(p_value,3)}\")\n",
    "# 应选择第一个及第三个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-11T04:22:20.392522Z",
     "start_time": "2020-03-11T04:22:20.377296Z"
    }
   },
   "source": [
    "#### F-Score (regression problem) F-统计量 (回归问题)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "F统计量（F-Score）用于检验线性回归模型的整体显著性。在sklearn中，其将对每一个变量分别建立一个一元的线性回归模型，然后分别报告每一个对应模型的F统计量。F-统计量的零假设是该线性模型系数不显著，在一元模型中，该统计量能够反映各变量与目标变量之间的线性关系。因此，我们应该选择具有较高F统计量的特征（更有可能拒绝原假设）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式：  \n",
    "  \n",
    "$F = \\frac{(SST - SSR)/(p - 1)}{SSR/(n - p)} =  \\frac{SST - SSR}{SSR/(n - 2)} =  \\frac{R^2}{(1 - R^2)(n - 2)} = \\frac{\\rho ^2}{(1 - \\rho ^2)(n - 2)}$  \n",
    " \n",
    "其中:  \n",
    "\n",
    "$SST = \\sum_{i=1}^{n}(y_i - \\overline{y}) ^2$  \n",
    "  \n",
    "$\\overline{y} = \\frac{1}{n} \\sum_{i=1}^{n}y_i$  \n",
    "  \n",
    "$SSR = \\sum_{i=1}^{n}(\\widehat{y}_i - \\overline{y})^2$  \n",
    "  \n",
    "$\\widehat{y}_i$ 为模型预测值\n",
    "  \n",
    "  \n",
    "SST为总平方和，SSR为回归平方和，p为线性回归自变量数（包括常数项，故在上述的一元线性模型中，p=2），$\\rho$为自变量与因变量的线性相关系数，n为总观测数。因上述线性模型为一元线性模型，故可证$\\rho^2 = R^2$。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:48.061253Z",
     "start_time": "2020-03-30T03:43:47.898644Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import f_regression\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "# 此数据集中，X，y均为连续变量，故此满足使用F统计量的条件\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# sklearn 中直接提供了函数用于计算F统计量\n",
    "# SelectKBest 将会基于一个判别函数自动选择得分高的变量\n",
    "# 这里的判别函数为F统计量\n",
    "selector = SelectKBest(f_regression, k=2) # k => 我们想要选择的变量数\n",
    "selector.fit(train_set, train_y) # 在训练集上训练\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_train.shape #(15000, 2), 其选择了第一个及第七个变量 \n",
    "assert np.array_equal(transformed_train, train_set[:,[0,6]]) \n",
    "\n",
    "transformed_test = selector.transform(test_set) # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,6]]);\n",
    "# 可见对于测试集，其依然选择了第一个及第七个变量 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:48.078788Z",
     "start_time": "2020-03-30T03:43:48.064038Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1个变量的F统计量为14111.79, p-值为0.0\n",
      "第2个变量的F统计量为71.99, p-值为0.0\n",
      "第3个变量的F统计量为317.04, p-值为0.0\n",
      "第4个变量的F统计量为23.93, p-值为0.0\n",
      "第5个变量的F统计量为6.54, p-值为0.011\n",
      "第6个变量的F统计量为35.93, p-值为0.0\n",
      "第7个变量的F统计量为846.61, p-值为0.0\n",
      "第8个变量的F统计量为98.06, p-值为0.0\n"
     ]
    }
   ],
   "source": [
    "# 验算一下我们的结果\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score, p_value = f_regression(train_set[:,idx].reshape(-1,1), train_y)\n",
    "    print(f\"第{idx + 1}个变量的F统计量为{round(score[0],2)}, p-值为{round(p_value[0],3)}\")\n",
    "# 故应选择第一个及第七个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mutual Information (regression problem) 互信息 (回归问题)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "互信息（Mutual Information）衡量变量间的相互依赖性。其本质为熵差，即$H(X) - H(X|Y)，即知道另一个变量信息后混乱的降低程度$。当且仅当两个随机变量独立时MI等于零。MI值越高，两变量之间的相关性则越强。与Pearson相关和F统计量相比，它还捕获了非线性关系。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式:  \n",
    "  \n",
    "- 若两个变量均为离散变量:  \n",
    "  \n",
    "    $I(x, y) = H(Y) - H(Y|X) = \\sum_{x\\in \\mathit{X}}  \\sum_{x\\in \\mathit{Y}} \\textit{p}_{(X,Y)}(x,y) \\textrm{log}(\\frac{\\textit{p}_{(X,Y)}(x,y)}{\\textit{p}_{X}(x)\\textit{p}_{Y}(y)})$  \n",
    "\n",
    "    $\\textit{p}_{(X,Y)}(x,y)$ 为x和y的联合概率质量函数 (PMF)， $\\textit{p}_{X}(x)$则为x的联合概率质量函数 (PMF)。  \n",
    "  \n",
    "- 若两个变量均为连续变量:  \n",
    "\n",
    "    $I(X, Y) = H(Y) - H(Y|X) = \\int_X \\int_Y  \\textit{p}_{(X,Y)}(x,y) \\textrm{log}(\\frac{\\textit{p}_{(X,Y)}(x,y)}{\\textit{p}_{X}(x)\\textit{p}_{Y}(y)}) \\, \\, dx dy$  \n",
    "    \n",
    "    $\\textit{p}_{(X,Y)}(x,y)$ 为x和y的联合概率密度函数 (PDF)，$\\textit{p}_{X}(x)$则为x的概率密度函数 (PDF)。连续变量情形下，在实际操作中，往往先对数据离散化分桶，然后逐个桶进行计算。\n",
    "  \n",
    "\n",
    "但是实际上，一种极有可能的情况是，x和y中的一个可能是离散变量，而另一个是连续变量。因此在sklearn中，它基于[1]和[2]中提出的基于k最临近算法的熵估计非参数方法。   \n",
    "   \n",
    "[1] A. Kraskov, H. Stogbauer and P. Grassberger, “Estimating mutual information”. Phys. Rev. E 69, 2004.  \n",
    "[2] B. C. Ross “Mutual Information between Discrete and Continuous Data Sets”. PLoS ONE 9(2), 2014. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:50.738206Z",
     "start_time": "2020-03-30T03:43:48.090754Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import mutual_info_regression\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "# 此数据集中，X，y均为连续变量，故此满足使用MI的条件\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:].astype(float)\n",
    "test_set = X[15000:,].astype(float)\n",
    "train_y = y[0:15000].astype(float)\n",
    "\n",
    "# KNN中的临近数是一个非常重要的参数\n",
    "# 故我们重写了一个新的MI计算函数更好的来控制这一参数\n",
    "def udf_MI(X, y):\n",
    "    result = mutual_info_regression(X, y, n_neighbors = 5) # 用户可以输入想要的临近数\n",
    "    return result\n",
    "\n",
    "# SelectKBest 将会基于一个判别函数自动选择得分高的变量\n",
    "# 这里的判别函数为F统计量\n",
    "selector = SelectKBest(udf_MI, k=2) # k => 我们想要选择的变量数\n",
    "selector.fit(train_set, train_y) # 在训练集上训练\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_train.shape #(15000, 2), 其选择了第一个及第八个变量\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,7]])\n",
    "\n",
    "transformed_test = selector.transform(test_set) # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,7]]);\n",
    "# 可见对于测试集，其依然选择了第一个及第八个变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.172117Z",
     "start_time": "2020-03-30T03:43:50.742937Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1个变量与因变量的互信息为0.38\n",
      "第2个变量与因变量的互信息为0.03\n",
      "第3个变量与因变量的互信息为0.1\n",
      "第4个变量与因变量的互信息为0.03\n",
      "第5个变量与因变量的互信息为0.02\n",
      "第6个变量与因变量的互信息为0.09\n",
      "第7个变量与因变量的互信息为0.37\n",
      "第8个变量与因变量的互信息为0.46\n"
     ]
    }
   ],
   "source": [
    "# 验算上述结果\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score = mutual_info_regression(train_set[:,idx].reshape(-1,1), train_y, n_neighbors = 5)\n",
    "    print(f\"第{idx + 1}个变量与因变量的互信息为{round(score[0],2)}\")\n",
    "# 故应选择第一个及第八个变量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Chi-squared Statistics (classification problem) 卡方统计量 (分类问题)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "卡方统计量主要用于衡量两个类别特征之间的相关性。sklearn提供了chi2函数用于计算卡方统计量。其输入的特征变量必须为布尔值或频率（故对于类别变量应考虑独热编码）。卡方统计量的零假设为两个变量是独立的，因为卡方统计量值越高，则两个类别变量的相关性越强。因此，我们应该选择具有较高卡方统计量的特征。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式:  \n",
    "  \n",
    "$\\chi^2 = \\sum_{i=1}^{r} \\sum_{j=1}^{c} \\frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}} = n \\sum_{i,j} p_ip_j(\\frac{\\frac{O_{i,j}}{n} - p_i p_j}{p_i p_j})^2$  \n",
    "  \n",
    "其中，$O_{i,j}$为在变量X上具有i-th类别值且在变量Y上具有j-th类别值的实际观测点计数。$E_{i,j}$ 为利用概率估计的应在在变量X上具有i-th类别值且在变量Y上具有j-th类别值的观测点数量。n为总观测数，$p_i$为在变量X上具有i-th类别值的概率，$p_j$为在变量Y上具有j-th类别值的概率。  \n",
    "  \n",
    "**值得注意的是，通过解析源代码，我们发现在sklearn中利用chi2计算出来的卡方统计量并不是统计意义上的卡方统计量**。当输入变量为布尔变量时，chi2计算值为该布尔变量为True时候的卡方统计量（我们将会在下文举例说明）。这样的优势是，独热编码生成的所有布尔值变量的chi2值之和将等于原始变量统计意义上的卡方统计量。  \n",
    "  \n",
    "举个简单的例子，假设一个变量I有0，1，2两种可能的值，则独特编码后一共会产生3个新的布尔值变量。这三个布尔值变量的chi2计算出来的值之和，将等于变量I与因变量直接计算得出的统计意义上的卡方统计量。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 解析sklearn中chi2的计算"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.264829Z",
     "start_time": "2020-03-30T03:43:53.173619Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Output</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>J</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>J</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>J</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>C</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Type  Output\n",
       "0     J       0\n",
       "1     J       1\n",
       "2     J       0\n",
       "3     B       2\n",
       "4     B       0\n",
       "5     B       1\n",
       "6     C       0\n",
       "7     C       0\n",
       "8     C       1\n",
       "9     C       2\n",
       "10    C       2"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 首先，随机生成一个数据集\n",
    "import pandas as pd\n",
    "sample_dict = {'Type': ['J','J','J',\n",
    "                        'B','B','B',\n",
    "                        'C','C','C','C','C'], \n",
    "               'Output': [0, 1, 0, \n",
    "                          2, 0, 1,  \n",
    "                          0, 0, 1, 2, 2,]}\n",
    "sample_raw = pd.DataFrame(sample_dict)\n",
    "sample_raw #原始数据，Output是我们的目标变量，Type为类别变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.297338Z",
     "start_time": "2020-03-30T03:43:53.266661Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([0.17777778, 0.42666667, 1.15555556]),\n",
       " array([0.91494723, 0.8078868 , 0.56114397]))"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 下面利用独热编码生成布尔变量，并利用sklearn计算每一个布尔变量的chi2值\n",
    "sample = pd.get_dummies(sample_raw)\n",
    "from sklearn.feature_selection import chi2\n",
    "chi2(sample.values[:,[1,2,3]],sample.values[:,[0]])\n",
    "# 第一行为每一个布尔变量的chi2值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.314068Z",
     "start_time": "2020-03-30T03:43:53.298829Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Output</th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>B</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>C</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>J</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>J</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Type  Output  Count\n",
       "0    B       0      1\n",
       "1    B       1      1\n",
       "2    B       2      1\n",
       "3    C       0      2\n",
       "4    C       1      1\n",
       "5    C       2      2\n",
       "6    J       0      2\n",
       "7    J       1      1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 下面直接计算原始变量Type与output统计学意义上的卡方统计量\n",
    "# 首先，先统计每一个类别下出现的观测数，用于创建列联表\n",
    "obs_df = sample_raw.groupby(['Type','Output']).size().reset_index()\n",
    "obs_df.columns = ['Type','Output','Count']\n",
    "obs_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "即列联表（contingency table）为：  \n",
    "\n",
    "| Type/Output | 0 | 1 | 2 |\n",
    "|------|------|------|------|\n",
    "|  B  | 1 | 1 | 1 |\n",
    "|  C  | 2 | 1 | 2 |\n",
    "|  J  | 2 | 1 | 0 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.321849Z",
     "start_time": "2020-03-30T03:43:53.315439Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1.7600000000000002,\n",
       " 0.779791873961373,\n",
       " 4,\n",
       " array([[1.36363636, 0.81818182, 0.81818182],\n",
       "        [2.27272727, 1.36363636, 1.36363636],\n",
       "        [1.36363636, 0.81818182, 0.81818182]]))"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.stats import chi2_contingency\n",
    "obs = np.array([[1, 1, 1], [2, 1, 2],[2, 1, 0]])\n",
    "chi2_contingency(obs) # 第一个值即为变量Type与output统计学意义上的卡方统计量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.330007Z",
     "start_time": "2020-03-30T03:43:53.323857Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 而chi2函数算出来的布尔值之和为即为原始变量的统计意义上的卡方统计量\n",
    "chi2(sample.values[:,[1,2,3]],sample.values[:,[0]])[0].sum() == chi2_contingency(obs)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.336779Z",
     "start_time": "2020-03-30T03:43:53.331453Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Power_divergenceResult(statistic=0.17777777777777778, pvalue=0.9149472287300311)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 那么sklearn中的chi2是如何计算的呢？\n",
    "# 不妨以第一个生成的布尔值为例，即Type为B\n",
    "# chi2出来的值为0.17777778\n",
    "# 而这与利用scipy以下代码计算出的计算一致\n",
    "from scipy.stats import chisquare\n",
    "f_exp = np.array([5/11, 3/11, 3/11]) * 3 # 预期频数为 output的先验概率 * Type为B 观测数\n",
    "chisquare([1,1,1], f_exp=f_exp) # [1,1,1] 即Type为B 的观测实际频数\n",
    "# 即sklearn 中的chi2 仅考虑了Type为B情形下的列连表"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 如何利用sklearn 来进行特征选择"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.350699Z",
     "start_time": "2020-03-30T03:43:53.338216Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import chi2\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import load_iris # 利用iris数据作为演示数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "# 此数据集中，X为连续变量，y为类别变量\n",
    "# 不满足chi2的使用条件\n",
    "\n",
    "# 将连续变量变为布尔值变量以满足chi2使用条件\n",
    "# 不妨利用其是否大于均值来生成布尔值（仅作为演示用）\n",
    "X = X > X.mean(0)\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "# sklearn 中直接提供了函数用于计算卡方统计量\n",
    "# SelectKBest 将会基于一个判别函数自动选择得分高的变量\n",
    "# 这里的判别函数为F统计量\n",
    "selector = SelectKBest(chi2, k=2) # k => 我们想要选择的变量数\n",
    "selector.fit(train_set, train_y) # 在训练集上训练\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_train.shape #(100, 2), 其选择了第三个及第四个变量 \n",
    "assert np.array_equal(transformed_train, train_set[:,[2,3]]) \n",
    "\n",
    "transformed_test = selector.transform(test_set) # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[2,3]]);\n",
    "# 可见对于测试集，其依然选择了第三个及第四个变量 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.358715Z",
     "start_time": "2020-03-30T03:43:53.351973Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1个变量与因变量的卡方统计量为29.69，p值为0.0\n",
      "第2个变量与因变量的卡方统计量为19.42，p值为0.0\n",
      "第3个变量与因变量的卡方统计量为31.97，p值为0.0\n",
      "第4个变量与因变量的卡方统计量为31.71，p值为0.0\n"
     ]
    }
   ],
   "source": [
    "# 验证上述结果\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score, p_value = chi2(train_set[:,idx].reshape(-1,1), train_y)\n",
    "    print(f\"第{idx + 1}个变量与因变量的卡方统计量为{round(score[0],2)}，p值为{round(p_value[0],3)}\")\n",
    "# 故应选择第三个及第四个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### F-Score (classification problem) F-统计量 (分类问题)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在分类机器学习问题中，若变量特征为类别特征，则我们可以使用独热编码配合上述chi2方法选择最重要的特征。但若特征为连续变量，则我们可以使用ANOVA-F值。ANOVA F统计量的零假设是若按目标变量（类别）分组，则连续变量的总体均值是相同的。故我们应选择具有高ANOVA-F统计量的连续变量，因为这些连续变量与目标变量的关联性强。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式：  \n",
    "  \n",
    "$F = \\frac{MSB}{MSE} = \\frac{ \\frac{SS(between)}{m-1}}{ \\frac{SS(error)}{n-m}}$   \n",
    "  \n",
    "其中，SS(between)为组间的平方和，即组均值和总体均值之间的平方和。 SS(error)是组内的平方和，即数据与组均值之间的平方和。 m是目标变量的总类别数，n是观测数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.368868Z",
     "start_time": "2020-03-30T03:43:53.360087Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import f_classif\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import load_iris # 利用iris数据作为演示数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "# 此数据集中，X为连续变量，y为类别变量\n",
    "# 满足ANOVA-F的使用条件\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "# sklearn 中直接提供了函数用于计算ANOVA-F\n",
    "# SelectKBest 将会基于一个判别函数自动选择得分高的变量\n",
    "# 这里的判别函数为F统计量\n",
    "selector = SelectKBest(f_classif, k=2) # k => 我们想要选择的变量数\n",
    "selector.fit(train_set, train_y) # 在训练集上训练\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_train.shape #(100, 2), 其选择了第三个及第四个变量 \n",
    "assert np.array_equal(transformed_train, train_set[:,[2,3]]) \n",
    "\n",
    "transformed_test = selector.transform(test_set) # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[2,3]]);\n",
    "# 可见对于测试集，其依然选择了第三个及第四个变量 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.375786Z",
     "start_time": "2020-03-30T03:43:53.370057Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1个变量与因变量的ANOVA-F统计量为91.39，p值为0.0\n",
      "第2个变量与因变量的ANOVA-F统计量为33.18，p值为0.0\n",
      "第3个变量与因变量的ANOVA-F统计量为733.94，p值为0.0\n",
      "第4个变量与因变量的ANOVA-F统计量为608.95，p值为0.0\n"
     ]
    }
   ],
   "source": [
    "# 验证上述结果\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score, p_value = f_classif(train_set[:,idx].reshape(-1,1), train_y)\n",
    "    print(f\"第{idx + 1}个变量与因变量的ANOVA-F统计量为{round(score[0],2)}，p值为{round(p_value[0],3)}\")\n",
    "# 故应选择第三个及第四个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mutual Information (classification problem) 互信息 (分类问题)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "【与1.1.1.5一样】互信息（Mutual Information）衡量变量间的相互依赖性。其本质为熵差，即$H(X) - H(X|Y)，即知道另一个变量信息后混乱的降低程度$。当且仅当两个随机变量独立时MI等于零。MI值越高，两变量之间的相关性则越强。与Pearson相关和F统计量相比，它还捕获了非线性关系。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式:  \n",
    "  \n",
    "- 若两个变量均为离散变量:  \n",
    "  \n",
    "    $I(x, y) = H(Y) - H(Y|X) = \\sum_{x\\in \\mathit{X}}  \\sum_{x\\in \\mathit{Y}} \\textit{p}_{(X,Y)}(x,y) \\textrm{log}(\\frac{\\textit{p}_{(X,Y)}(x,y)}{\\textit{p}_{X}(x)\\textit{p}_{Y}(y)})$  \n",
    "\n",
    "    $\\textit{p}_{(X,Y)}(x,y)$ 为x和y的联合概率质量函数 (PMF)， $\\textit{p}_{X}(x)$则为x的的联合概率质量函数 (PMF)。  \n",
    "  \n",
    "- 若两个变量均为连续变量:  \n",
    "\n",
    "    $I(X, Y) = H(Y) - H(Y|X) = \\int_X \\int_Y  \\textit{p}_{(X,Y)}(x,y) \\textrm{log}(\\frac{\\textit{p}_{(X,Y)}(x,y)}{\\textit{p}_{X}(x)\\textit{p}_{Y}(y)}) \\, \\, dx dy$  \n",
    "    \n",
    "    $\\textit{p}_{(X,Y)}(x,y)$ 为x和y的联合概率密度函数 (PDF)，$\\textit{p}_{X}(x)$则为x的的联合概率密度函数 (PDF)。连续变量情形下，在实际操作中，往往先对数据离散化分桶，然后逐个桶进行计算。\n",
    "  \n",
    "\n",
    "但是实际上，一种极有可能的情况是，x和y中的一个可能是离散变量，而另一个是连续变量。因此在sklearn中，它基于[1]和[2]中提出的基于k最临近算法的熵估计非参数方法。   \n",
    "   \n",
    "[1] A. Kraskov, H. Stogbauer and P. Grassberger, “Estimating mutual information”. Phys. Rev. E 69, 2004.  \n",
    "[2] B. C. Ross “Mutual Information between Discrete and Continuous Data Sets”. PLoS ONE 9(2), 2014. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.397199Z",
     "start_time": "2020-03-30T03:43:53.377347Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import mutual_info_classif\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import load_iris # 利用iris数据作为演示数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "# 此数据集中，X为连续变量，y为类别变量\n",
    "# 满足MI的使用条件\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "# KNN中的临近数是一个非常重要的参数\n",
    "# 故我们重写了一个新的MI计算函数更好的来控制这一参数\n",
    "def udf_MI(X, y):\n",
    "    result = mutual_info_classif(X, y, n_neighbors = 5) # 用户可以输入想要的临近数\n",
    "    return result\n",
    "\n",
    "# SelectKBest 将会基于一个判别函数自动选择得分高的变量\n",
    "# 这里的判别函数为F统计量\n",
    "selector = SelectKBest(udf_MI, k=2) # k => 我们想要选择的变量数\n",
    "selector.fit(train_set, train_y) # 在训练集上训练\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_train.shape #(100, 2), 其选择了第三个及第四个变量 \n",
    "assert np.array_equal(transformed_train, train_set[:,[2,3]]) \n",
    "\n",
    "transformed_test = selector.transform(test_set) # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[2,3]]);\n",
    "# 可见对于测试集，其依然选择了第三个及第四个变量 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.414839Z",
     "start_time": "2020-03-30T03:43:53.398716Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1个变量与因变量的互信息为0.56\n",
      "第2个变量与因变量的互信息为0.28\n",
      "第3个变量与因变量的互信息为0.99\n",
      "第4个变量与因变量的互信息为1.02\n"
     ]
    }
   ],
   "source": [
    "# 验算上述结果\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score = mutual_info_classif(train_set[:,idx].reshape(-1,1), train_y, n_neighbors = 5)\n",
    "    print(f\"第{idx + 1}个变量与因变量的互信息为{round(score[0],2)}\")\n",
    "# 故应选择第三个及第四个变量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multivariate Filter Methods 多元特征过滤"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "单变量特征过滤仅考虑了每一变量与目标变量之间的关系，而忽视了变量间的相关性。多元变量过滤则解决了这一问题，其考虑了变量之间的相互关系，基于整个特征空间选择最佳特征。因此多元特征过滤在删除冗余变量方面表现更好。这里利用亚利桑那州立大学开发的[skfeature](https://github.com/jundongl/scikit-feature)模块来进行多元特征过滤。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Max-Relevance Min-Redundancy (mRMR) 最大相关最小冗余"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最大相关最小冗余试图寻找一个与目标变量有较高相关性（例如：MI）的变量子集，同时这个子集中的变量还应具有较低的相互关联性。通过解析源代码，我们发现，skfeature中最大相关最小冗余方法仅适用于分类问题中的离散特征，因为它计算过程中使用的是计算离散情形下的互信息 (MI)的公式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式：  \n",
    "假设数据集共包含m个特征，则基于mRMR公式，其中第n个特征的重要性可被表示为：\n",
    "\n",
    "$f^{mRMR}(X_i) = I(Y, X_i) - \\frac{1}{|S|}\\sum_{X_s \\in S} I(X_s, X_i)$  \n",
    "  \n",
    "$I(Y, X_i)$为变量$X_i$与目标变量Y的互信息。$\\frac{1}{|S|}\\sum_{X_s \\in S} I(X_s, X_i)$为变量$X_i$与现有变量子集中所有变量的互信息的平均值。      \n",
    "  \n",
    "mRMR其实是一个逐步（step-wise）的方法，在mRMR特征选择过程的每一步中，具有最高特征重要性$f^{mRMR}(X_i)$的变量$X_i, (X_i \\notin  S)$将会被加入子集中，直至子集中的变量数达到用户要求。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.437413Z",
     "start_time": "2020-03-30T03:43:53.416458Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.information_theoretical_based import MRMR\n",
    "from sklearn.datasets import load_iris  # 利用iris数据作为演示数据集\n",
    "\n",
    "# 载入数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的50个观测点作为测试集\n",
    "# 由于skfeature中的mRMR仅适用于离散变量\n",
    "# 因此我们通过将float转换为int而把所有连续变量转换为离散变量\n",
    "# 此转换仅用于演示目的\n",
    "\n",
    "train_set = X[0:100,:].astype(int)\n",
    "test_set = X[100:,].astype(int)\n",
    "train_y = y[0:100].astype(int)\n",
    "\n",
    "feature_index,_,_ = MRMR.mrmr(train_set, train_y, n_selected_features=2) # 在训练集上训练\n",
    "transformed_train = train_set[:,feature_index] # 转换训练集\n",
    "assert np.array_equal(transformed_train, train_set[:,[2,3]])  # 其选择了第三个及第四个变量 \n",
    "\n",
    "transformed_test = test_set[:,feature_index] # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[2,3]]) # 其选择了第三个及第四个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Correlation-based Feature Selection (CFS) 基于相关性的特征选择"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与mRMR类似，基于相关性的特征选择（CFS）也基于一个类似的假设：一个好的特征子集应包含与目标高度相关且彼此不相关的特征。 通过解析源代码，我们发现，skfeature中CFS的实现也仅适用于分类问题中的离散特征。因为其使用的是离散情形下的对称不确定性（symmetrical uncertainty）作为变量间相关性的衡量标准。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式：  \n",
    "  \n",
    "$Merit_S =\\frac{ \\sum_{i=1}^{k} SU(X_i, y)}{\\sqrt{k + \\sum_{i=1}^{k} \\sum_{j=1}^{k} SU(X_i, X_j) }}, \\, \\, \\, X_i \\in S^*$  \n",
    "  \n",
    "$SU(X, Y) = 2 * \\frac{H(X) + H(Y) - H(X|Y)}{H(X) + H(Y)}$  \n",
    "  \n",
    "$S$为特征子集. 我们需要寻找最大化$Merit_S$的最优子集$S^*$。  \n",
    "$SU(X_i, y)$为离散变量$X_i$与目标变量$y$间的对称不确定性（SU）。     \n",
    "$SU(X_i, X_j)$为离散变量$X_i$与离散变量$X_j$间的对称不确定性（SU）。  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.461564Z",
     "start_time": "2020-03-30T03:43:53.443040Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.statistical_based import CFS\n",
    "from sklearn.datasets import load_iris  # 利用iris数据作为演示数据集\n",
    "\n",
    "# 载入数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的50个观测点作为测试集\n",
    "# 由于skfeature中的CFS仅适用于离散变量\n",
    "# 因此我们通过将float转换为int而把所有连续变量转换为离散变量\n",
    "# 此转换仅用于演示目的\n",
    "\n",
    "train_set = X[0:100,:].astype(int)\n",
    "test_set = X[100:,].astype(int)\n",
    "train_y = y[0:100].astype(int)\n",
    "\n",
    "num_feature = 2 # 从原数据集中选择两个变量\n",
    "\n",
    "feature_index = CFS.cfs(train_set, train_y) # 在训练集上训练\n",
    "transformed_train = train_set[:,feature_index[0:num_feature]] # 转换训练集\n",
    "assert np.array_equal(transformed_train, train_set[:,[3,2]])  # 其选择了第三个及第四个变量 \n",
    "\n",
    "transformed_test = test_set[:,feature_index[0:num_feature]] # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[3,2]]) # 其选择了第三个及第四个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Fast Correlation-based Filter (FCBF) 基于相关性的快速特征选择"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "相比于CFS，FCBS能够更加高效的筛选变量。其同样为逐步（step-wise）的方法，具体步骤与mRMR非常类似，但FCBS使用对称不确定性（SU）衡量变量间的关联性。FCBF首先剔除与目标变量具有较低SU值的变量，并对剩下的变量按与目标变量的SU值从最高到最低排序，然后逐一删除冗余特征。与mRMR，CFS相似，在skfeature中实现的FCBF仅适用于具有离散变量的分类问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式：  \n",
    "  \n",
    "$SU(X, Y) = 2 * \\frac{H(X) + H(Y) - H(X|Y)}{H(X) + H(Y)}$   \n",
    "  \n",
    "步骤：  \n",
    "  \n",
    "1)计算每个特征变量$X_i$与目标变量$y$之间的相关性$SU(X_i,y)$  \n",
    "2)仅保留$SU(X_i,y)$大于一定阈值$\\sigma$的特征变量，组成候选列表$S_{list}$  \n",
    "3)按照$SU(X_i,y)$值的大小对$S_{list}$中的变量从大到小排序  \n",
    "4)按顺序依次计算每一个特征$X_i$与候选列表$S_{list}$中顺序靠后的每一个特征$X_j$的相关SU值$SU(X_i,X_j)$  \n",
    "5)若$SU(X_i,X_j)$大于$SU(X_j,y)$，则从候选列表$S_{list}$ 中删除$X_j$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.474859Z",
     "start_time": "2020-03-30T03:43:53.463397Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.information_theoretical_based import FCBF\n",
    "from sklearn.datasets import load_iris  # 利用iris数据作为演示数据集\n",
    "\n",
    "# 载入数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的50个观测点作为测试集\n",
    "# 由于skfeature中的FCFS仅适用于离散变量\n",
    "# 因此我们通过将float转换为int而把所有连续变量转换为离散变量\n",
    "# 此转换仅用于演示目的\n",
    "\n",
    "train_set = X[0:100,:].astype(int)\n",
    "test_set = X[100:,].astype(int)\n",
    "train_y = y[0:100].astype(int)\n",
    "\n",
    "num_feature = 2 # 从原数据集中选择两个变量\n",
    "\n",
    "feature_index = FCBF.fcbf(train_set, train_y, n_selected_features = num_feature)[0] # 在训练集上训练\n",
    "transformed_train = train_set[:,feature_index[0:num_feature]] # 转换训练集\n",
    "assert np.array_equal(transformed_train, train_set[:,[3, 2]])  # 其选择了第三个及第四个变量 \n",
    "# 这是由于其他变量目标变量𝑦之间的相关性过低造成的\n",
    "\n",
    "transformed_test = test_set[:,feature_index[0:num_feature]] # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[3, 2]]) # # 其选择了第三个及第四个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ReliefF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ReliefF方法是一种基于Relief方法的特征加权算法。在Relief方法中，其根据特征与目标变量的相关性强弱（二分类）给变量分配权重，并删除权重低于特定阈值的特征。其将相关性定义为变量区分邻近观测点的能力。\n",
    "\n",
    "具体来说，在每一步中，Relief方法都会从训练集中随机选择一个观测点S，然后找到具有相同目标标签的S的最近邻观测点，称为NearHit。它还将找到具有不同目标标签的S的最近邻观测点，称为NearMiss。然后根据以下规则更新每个功能的权重：\n",
    "\n",
    "1）若观测点S在某变量上与NearHit的距离大于与NearMiss的距离，则该变量的权重将增加，因为变量有助于区分最邻近情形下的目标标签。2）相反，若观测点S在某变量上与NearHit的距离小于与NearMiss的距离，则该变量的权重会降低。\n",
    "\n",
    "将上述过程重复m次，最后我们会获得每个特征变量的平均权重。特征变量的权重越大，则特征的分类能力越强，越应该被留在最终的特征子集中。\n",
    "\n",
    "在ReliefF中，其修改了权重更新的方式，因此ReliefF方法可以被应用于多类分类问题。另外，它随机采样K个最近的观测点而不是一个。\n",
    "\n",
    "在skfeature中实现的ReliefF可用于分类问题中的连续特征或二元类别特征，因其使用的是L1范数来衡量差异。针对非二元特征，我们可以先将其独热编码，再使用ReliefF方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "公式：  \n",
    "\n",
    "Formula:  \n",
    "  \n",
    "$W(X_i) = W(X_i) - \\frac{\\sum_{i=1}^{k} diff(X_i, S, H_j) }{m*k} + \\frac{\\sum_{C\\notin class(S)} [\\frac{p(C)}{1-P(class(S))} \\sum_{i=1}^{k} diff(X_i, S, M_j(C)) ]}{m*k}$  \n",
    "\n",
    "$diff(X_i, R_1, R_2) = \\left\\{\\begin{matrix}\n",
    "\\frac{|R_1(X_i) - R_2(X_i)|}{max(X_i)- min(X_i)} & 若X_i\\ 为连续变量\\\\ \n",
    " 0 & 若X_i为类别变量且\\ R_1(X_i) =R_2(X_i)\\\\ \n",
    " 1 & 若X_i为类别变量且 \\ R_1(X_i) \\neq R_2(X_i)\n",
    "\\end{matrix}\\right.$ \n",
    "  \n",
    "$R_1$ and $R_2$ 为任意两个观测点。$X_i$为某一特征变量. S为我们选择的观测点. $H_j$为第j个NearHit，$M_j$为第j个NearMiss. C为与我们所选的观测点不同的其他目标类别标签。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.510692Z",
     "start_time": "2020-03-30T03:43:53.476404Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.similarity_based import reliefF\n",
    "from sklearn.datasets import load_iris  # 利用iris数据作为演示数据集\n",
    "\n",
    "# 载入数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的50个观测点作为测试集\n",
    "# skfeature中的reliefF直接适用于连续变量\n",
    "\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "num_feature = 2 # 从原数据集中选择两个变量\n",
    "\n",
    "score = reliefF.reliefF(train_set, train_y) # 计算每一个变量的权重\n",
    "feature_index = reliefF.feature_ranking(score) # 依据权重选择变量\n",
    "transformed_train = train_set[:,feature_index[0:num_feature]] # 转换训练集\n",
    "assert np.array_equal(transformed_train, train_set[:,[2, 3]])  # 其选择了第三个及第四个变量 \n",
    "\n",
    "transformed_test = test_set[:,feature_index[0:num_feature]] # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[2, 3]]) # 其选择了第三个及第四个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Spectral Feature Selection (SPEC) 基于谱图的特征选择"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "基于谱图的特征选择(SPEC)方法是基于谱图理论的无监督方法。其首先建立变量相似度集合S，并建立其图表示。然后，其根据构造图的频谱评估特征。由于在skfeature中实现的SPEC方基于RBF（高斯）内核建立相似集，因此其可用于分类问题中的连续特征或二元类别特征。针对非二元特征，我们可以先将其独热编码，再使用ReliefF方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:53.539878Z",
     "start_time": "2020-03-30T03:43:53.512125Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.similarity_based import SPEC\n",
    "from sklearn.datasets import load_iris  # 利用iris数据作为演示数据集\n",
    "\n",
    "# 载入数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的50个观测点作为测试集\n",
    "# skfeature中的SEPC方法直接适用于连续变量\n",
    "\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "num_feature = 2 # 从原数据集中选择两个变量\n",
    "\n",
    "score = SPEC.spec(train_set) # 计算每一个变量的得分\n",
    "feature_index = SPEC.feature_ranking(score) #依据变量得分选择变量\n",
    "transformed_train = train_set[:,feature_index[0:num_feature]] # 转换训练集\n",
    "assert np.array_equal(transformed_train, train_set[:,[1, 0]])  # 其选择了第一个及第二个变量 \n",
    "\n",
    "transformed_test = test_set[:,feature_index[0:num_feature]] # 转换测试集\n",
    "assert np.array_equal(transformed_test, test_set[:,[1, 0]]) # 其选择了第一个及第二个变量 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrapper Methods 封装方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "封装方法将特征选择问题视作搜索问题，即其目标为从特征子集集合中搜索出一个最佳的子集，而这一子集在模型中表现最佳。在每一步中，其在特征子集上训练模型，然后对其进行评估，并在下一步继续调整特征子集,重新训练评估，直到找到最佳子集或达到最大迭代次数为止。穷尽搜索在封装方法中为NP-Hard，故人们提出了一些方法来降低封装方法所需要的迭代次数，以便可以在有限的时间内达到一个较好的效果。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image](../images/封装法工作流.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deterministic Algorithms 确定性算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在不考虑模型随机性的情况下，给定相同的数据输入，确定性算法将始终输出相同的最优特征子集。顺序向前选择（SFS），顺序向后选择（SBS）均为确定性算法。\n",
    "\n",
    "顺序向前选择（SFS）方法将从最优单变量模型开始，然后在迭代中，其会在上一步变量子集的基础上，以穷举的方法在现有变量子集中增加一个新变量，使得新增一个变量后的变量子集可以获得最大的模型表现提升。迭代将持续直到所选变量的数量满足要求为止。\n",
    "\n",
    "顺序向后选择（SBS）则从适合一个包含所有变量的模型开始，然后在迭代中，其会在上一步变量子集的基础上，以穷举的方法在现有变量子集中删除一个对模型负影响最低的变量，直到所选特征的数量满足要求为止。\n",
    "\n",
    "但是顺序向前选择（SFS）方法和顺序向后选择（SBS）均为逐步（step-wise）的方法，都可能会陷入局部最优状态。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Recursive Feature Elimination (SBS) 递归式特征消除"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在sklearn中，它仅实现递归特征消除（SBS）方法。其提供了两个函数来实现这一方法，一个是RFE，另一个是RFECV。与RFE函数相比，REFCV使用交叉验证的结果来选择最优的特征数量，而在RFE中，要选择的特征数量由用户预定义。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.695018Z",
     "start_time": "2020-03-30T03:43:53.541618Z"
    }
   },
   "outputs": [],
   "source": [
    "# RFE函数 演示\n",
    "import numpy as np\n",
    "from sklearn.feature_selection import RFE\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# 选择用于衡量子集表现的有监督的机器学习模型\n",
    "from sklearn.ensemble import ExtraTreesRegressor # 使用ExtraTrees 模型作为示范\n",
    "clf = ExtraTreesRegressor(n_estimators=25)\n",
    "selector = RFE(estimator = clf, n_features_to_select = 4, step = 1) \n",
    "# 与RFECV不同，此处RFE函数需要用户定义选择的变量数量，此处设置为选择4个最好的变量，每一步我们仅删除一个变量\n",
    "\n",
    "selector = selector.fit(train_set, train_y) # 在训练集上训练\n",
    "\n",
    "transformed_train = train_set[:,selector.support_]  # 转换训练集\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,5,6,7]]) # 选择了第一个，第六个，第七个及第八个变量\n",
    "\n",
    "transformed_test = test_set[:,selector.support_] # 转换训练集\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,5,6,7]]) # 选择了第一个，第六个，第七个及第八个变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:44:14.363261Z",
     "start_time": "2020-03-30T03:43:56.696872Z"
    }
   },
   "outputs": [],
   "source": [
    "# RFECV 函数 演示\n",
    "import numpy as np\n",
    "from sklearn.feature_selection import RFECV\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# 选择用于衡量子集表现的有监督的机器学习模型\n",
    "from sklearn.ensemble import ExtraTreesRegressor # 使用ExtraTrees 模型作为示范\n",
    "clf = ExtraTreesRegressor(n_estimators=25)\n",
    "selector = RFECV(estimator = clf, step = 1, cv = 5) # 使用5折交叉验证\n",
    "# 每一步我们仅删除一个变量\n",
    "selector = selector.fit(train_set, train_y)\n",
    "\n",
    "transformed_train = train_set[:,selector.support_]  # 转换训练集\n",
    "assert np.array_equal(transformed_train, train_set) # 选择了所有的变量\n",
    "\n",
    "transformed_test = test_set[:,selector.support_] # 转换训练集\n",
    "assert np.array_equal(transformed_test, test_set) # 选择了所有的变量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Randomized Algorithms 随机方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与确定性算法相比，随机方法在搜索最佳特征子集时引入了一定程度的随机性。因此，在相同数据输入的情形下，它可能会输出不同的最优特征子集结果，但此方法中的随机性将有助于避免模型陷入局部最优结果。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Simulated Annealing (SA) 基于模拟退火特征选择"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模拟退火是一种随机最优化方法，近年来被引入到特征选择领域。在每一步中，我们将根据当前的最优特征子集随机选择一个特征子集。若新的特征子集效果更好，那么我们将采用它并更新当前最优特征子集。若新特征子集的表现不佳，我们仍会以一定的概率接受它，这个接受概率取决于当前的状态（温度）。\n",
    "\n",
    "以一定的概率接受变现不佳的特征子集对于模拟退火算法至关重要，因为这有助于算法避免陷入局部最优状态。随着迭代的进行，模拟退火算法可收敛为良好且稳定的最终结果。\n",
    "\n",
    "由于未发现能较好实现SA算法的函数，因此我编写了一个[python脚本](../SA.py)来实现SA算法，以供您参考。其能够很好地兼容sklearn中的模型，支持分类及回归问题。它还提供了内置交叉验证方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-28T15:49:02.243080Z",
     "start_time": "2020-03-28T15:49:02.236948Z"
    }
   },
   "source": [
    "公式:  \n",
    "  \n",
    "在每一步中，接受表现不佳的特征子集的概率为:  \n",
    "$Prob = exp( - \\frac{loss_{n} - loss_{o}}{k * Cur\\_{Temperature}})$  \n",
    "  \n",
    "Prob为接受表现不佳的特征子集的概率，$loss_n$为新特征子集的损失（loss），$loss_o$为新特征子集创建前的最优（最低）损失（loss），$Cur\\_{Temperature}$为当前的温度。\n",
    "  \n",
    "模拟退火的伪代码为:  \n",
    "\n",
    "![image](../images/基于模拟退火特征选择.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "回归问题演示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:45:25.268299Z",
     "start_time": "2020-03-30T03:44:14.364883Z"
    }
   },
   "outputs": [],
   "source": [
    "import sys \n",
    "sys.path.append(\"..\") \n",
    "from SA import Simulated_Annealing # 导入我们撰写的模块\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# 选择用于衡量子集表现的有监督的机器学习模型\n",
    "from sklearn.ensemble import ExtraTreesRegressor # 使用ExtraTrees 模型作为示范\n",
    "\n",
    "# 选择模拟退火中评价特征子集的的损失函数\n",
    "from sklearn.metrics import mean_squared_error # 回归问题我们使用MSE\n",
    "\n",
    "clf = ExtraTreesRegressor(n_estimators=25)\n",
    "selector = Simulated_Annealing(loss_func = mean_squared_error, estimator = clf, \n",
    "                               init_temp = 0.2, min_temp = 0.005, iteration = 10, alpha = 0.9)\n",
    "# 在训练集中训练\n",
    "# SA.py中有具体每个参数的含义，此处不赘述\n",
    "\n",
    "selector.fit(X_train = train_set, y_train = train_y, cv = 5) # 使用5折交叉验证\n",
    "\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_test = selector.transform(test_set)  # 转换测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:45:25.272088Z",
     "start_time": "2020-03-30T03:45:25.269721Z"
    }
   },
   "outputs": [],
   "source": [
    "selector.best_sol # 返回最优特征的索引\n",
    "selector.best_loss; # 返回最优特征子集对应的损失"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "分类问题演示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:45:25.567132Z",
     "start_time": "2020-03-30T03:45:25.273497Z"
    }
   },
   "outputs": [],
   "source": [
    "import sys \n",
    "sys.path.append(\"..\") \n",
    "import numpy as np\n",
    "import random\n",
    "from SA import Simulated_Annealing # 导入我们撰写的模块\n",
    "\n",
    "from sklearn.datasets import load_iris  # 利用iris数据作为演示数据集\n",
    "\n",
    "# 载入数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的前20个观测点作为验证集，剩下的30个观测作为测试集\n",
    "train_set = X[0:100,:]\n",
    "val_set = X[100:120,:]\n",
    "test_set = X[120:,:]\n",
    "\n",
    "train_y = y[0:100]\n",
    "val_y = y[100:120]\n",
    "test_y = y[120:]\n",
    "\n",
    "# 重制随机种子 \n",
    "# 随机方法需要随机性的存在\n",
    "random.seed()\n",
    "np.random.seed()\n",
    "\n",
    "# 选择用于衡量子集表现的有监督的机器学习模型\n",
    "from sklearn.ensemble import ExtraTreesClassifier # we use extratree as predictive model\n",
    "\n",
    "# 选择模拟退火中评价特征子集的的损失函数\n",
    "from sklearn.metrics import log_loss # 回归问题中，我们使用交叉熵损失函数\n",
    "\n",
    "clf = ExtraTreesClassifier(n_estimators=25)\n",
    "selector = Simulated_Annealing(loss_func = log_loss, estimator = clf, \n",
    "                               init_temp = 0.2, min_temp = 0.005, iteration = 10, \n",
    "                               alpha = 0.9, predict_type = 'predict_proba')\n",
    "# 在训练集中训练\n",
    "# SA.py中有具体每个参数的含义，此处不赘述\n",
    "\n",
    "selector.fit(X_train = train_set, y_train = train_y, X_val = val_set, \n",
    "             y_val = val_y, stop_point = 15) \n",
    "# 此函数允许用户导入自己定义的验证集，此处尝试一下\n",
    "\n",
    "transformed_train = selector.transform(train_set)  # 转换训练集\n",
    "transformed_test = selector.transform(test_set)  # 转换测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:45:25.571342Z",
     "start_time": "2020-03-30T03:45:25.568710Z"
    }
   },
   "outputs": [],
   "source": [
    "selector.best_sol # 返回最优特征的索引\n",
    "selector.best_loss; # 返回最优特征子集对应的损失"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Genetic Algorithm (GA) 基于基因算法特征选择"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "遗传算法是一种基于进化生物学概念的最优化搜索算法。它借鉴了自然界中的进化过程，并通过允许个体候选解通过“交叉”和“变异”来进化得到更优的候选解及种群。其还结合了自然界中的竞争理念，即仅允许最合适或最优的几个候选解“生存”下来并“繁殖”其后代。经过种群及个体候选解的持续迭代，基因算法（GA）会收敛到优化解决方案。\n",
    "\n",
    "与模拟退火类似，我也编写了一个[python脚本](../GA.py)来实现GA算法，以供您参考。它提供了两种算法，包括“one-max”和“ NSGA2”。 “one-max”为传统的单目标GA算法，“NSGA2”则为一个多目标GA算法。在特征选择中，“one-max”的目标是减少模拟在验证集上的损失，而“NSGA2”的目标一是减少损失，二是同时要最小化特征子集中特征的数量。\n",
    "  \n",
    "此python脚本能够很好地兼容sklearn中的模型，支持分类及回归问题。它还提供了内置交叉验证方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "基因算法的伪代码如下：  \n",
    "\n",
    "![image](../images/基于基因算法特征选择.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "回归问题演示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.026251Z",
     "start_time": "2020-03-30T03:45:25.573202Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10/10 [01:31<00:00,  9.11s/it]\n"
     ]
    }
   ],
   "source": [
    "import sys \n",
    "sys.path.append(\"..\") \n",
    "from GA import Genetic_Algorithm # 导入我们撰写的模块\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# 选择用于衡量子集表现的有监督的机器学习模型\n",
    "from sklearn.ensemble import ExtraTreesRegressor # 使用ExtraTrees 模型作为示范\n",
    "\n",
    "# 选择模拟退火中评价特征子集的的损失函数\n",
    "from sklearn.metrics import mean_squared_error # 回归问题我们使用MSE\n",
    "\n",
    "clf = ExtraTreesRegressor(n_estimators=25)\n",
    "selector = Genetic_Algorithm(loss_func = mean_squared_error, estimator = clf, \n",
    "                             n_gen = 10, n_pop = 20, algorithm = 'NSGA2')\n",
    "# 在训练集中训练\n",
    "# GA.py中有具体每个参数的含义，此处不赘述\n",
    "\n",
    "selector.fit(X_train = train_set, y_train = train_y, cv = 5) # 使用5折交叉验证\n",
    "\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_test = selector.transform(test_set)  # 转换测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.030792Z",
     "start_time": "2020-03-30T03:47:34.028121Z"
    }
   },
   "outputs": [],
   "source": [
    "selector.best_sol # 返回最优特征的索引\n",
    "selector.best_loss; # 返回最优特征子集对应的损失"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "分类问题演示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.316372Z",
     "start_time": "2020-03-30T03:47:34.032578Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 15/15 [00:00<00:00, 128.87it/s]\n"
     ]
    }
   ],
   "source": [
    "import sys \n",
    "sys.path.append(\"..\") \n",
    "import numpy as np\n",
    "import random\n",
    "from GA import Genetic_Algorithm # 导入我们撰写的模块\n",
    "\n",
    "from sklearn.datasets import load_iris  # 利用iris数据作为演示数据集\n",
    "\n",
    "# 载入数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的前20个观测点作为验证集，剩下的30个观测作为测试集\n",
    "train_set = X[0:100,:]\n",
    "val_set = X[100:120,:]\n",
    "test_set = X[120:,:]\n",
    "\n",
    "train_y = y[0:100]\n",
    "val_y = y[100:120]\n",
    "test_y = y[120:]\n",
    "\n",
    "# 重制随机种子 \n",
    "# 随机方法需要随机性的存在\n",
    "random.seed()\n",
    "np.random.seed()\n",
    "\n",
    "# 选择用于衡量子集表现的有监督的机器学习模型\n",
    "from sklearn.ensemble import ExtraTreesClassifier # we use extratree as predictive model\n",
    "\n",
    "# 选择模拟退火中评价特征子集的的损失函数\n",
    "from sklearn.metrics import log_loss # 回归问题中，我们使用交叉熵损失函数\n",
    "\n",
    "clf = ExtraTreesClassifier(n_estimators=25)\n",
    "selector = Genetic_Algorithm(loss_func = log_loss, estimator = clf, \n",
    "                             n_gen = 15, n_pop = 10, predict_type = 'predict_proba')\n",
    "# 在训练集中训练\n",
    "# GA.py中有具体每个参数的含义，此处不赘述\n",
    "\n",
    "selector.fit(X_train = train_set, y_train = train_y, X_val = val_set, \n",
    "             y_val = val_y, stop_point = 15) \n",
    "# 此函数允许用户导入自己定义的验证集，此处尝试一下\n",
    "\n",
    "transformed_train = selector.transform(train_set)  # 转换训练集\n",
    "transformed_test = selector.transform(test_set)  # 转换测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.320140Z",
     "start_time": "2020-03-30T03:47:34.317774Z"
    }
   },
   "outputs": [],
   "source": [
    "selector.best_sol # 返回最优特征的索引\n",
    "selector.best_loss; # 返回最优特征子集对应的损失"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embedded Methods 嵌入方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "过滤法的特征选择过程与后续的机器学习模型无关，因此过滤法可能导致较差的模型性能。\n",
    "\n",
    "封装方法利用预定义的有监督的机器学习模型来选择最佳功能。但是，由于它们需要在大量可能的特征子集上多次训练模型，因此尽管它们通常会导致更好的性能，但它们同时也需要较长的处理时间。\n",
    "\n",
    "嵌入式方法将特征选择过程嵌入到机器学习模型中，即利用机器学习来为每一个特征打分。嵌入式方法在创建模型时即完成了对特征子集的选择。因此，与过滤法相比，它们往往具有更好的性能。与封装方法相比，它们节省了大量的处理时间和计算能力。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**三种方法的一个简单对比**.  \n",
    "   \n",
    "|方面 | 过滤法 | 封装法\t| 嵌入法\n",
    "|--|--|--|--|\n",
    "|是否需模型参与| 否 | 是 | 是 |\n",
    "|是否需要交叉验证 |\t可能(可利用交叉验证选择保留的特征数目) | 是 | 可能(可利用交叉验证选择保留的特征数目)\n",
    "|处理时间 |\t短 | 长 | 中等\n",
    "|对参与模型的限制|\t无 | 无 | 是 (嵌入法使用的模型为线性模型或树类模型)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image](../images/嵌入法工作流.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基于正则化模型的方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "许多机器学习模型在其损失函数中引入了正则项（L1正则或L2正则），以防止过拟合问题。线性模型（例如线性向量支持机，逻辑回归，线性回归）中的L1正则项能够有效地将某些特征的特征系数缩小为零，从而实现解的稀疏。因此，基于带正则项线性模型的特征系数，我们可以为特征打分。系数越高，往往该特征在线性模型中越重要。\n",
    "\n",
    "我们可以使用sklearn SelectFromModel函数删除特征系数低或为零的特征。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Lasso Regression (Linear Regression with L1 Norm) 套索回归"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.517758Z",
     "start_time": "2020-03-30T03:47:34.321877Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.346,  0.003, -0.   , -0.   , -0.   , -0.   , -0.033,  0.   ])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.linear_model import Lasso # 我们也可以使用带L2正则项的岭回归\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "clf = Lasso(normalize=True, alpha = 0.001)  \n",
    "# 在进行线性回归前，我们需要先对变量进行缩放操作，否则回归系数大小无法比较\n",
    "# alpha控制正则效果的大小，alpha越大，正则效果越强\n",
    "\n",
    "clf.fit(train_set, train_y) # 在训练集上训练\n",
    "np.round(clf.coef_ ,3) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.524942Z",
     "start_time": "2020-03-30T03:47:34.519609Z"
    }
   },
   "outputs": [],
   "source": [
    "selector = SelectFromModel(clf, prefit=True, threshold=1e-5)\n",
    "# 阈值被设置为1e-5，因此绝对系数低于1e-5的特征将被删除\n",
    "# 我们还可以设置max_features参数以选择最重要的前几个特征\n",
    "\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_test = selector.transform(test_set) #转换测试集\n",
    "\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,1,6]]) \n",
    "# 选择第一个，第二个及第七个变量\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,1,6]]) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Logistic Regression (with L1 Norm) 逻辑回归"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.538253Z",
     "start_time": "2020-03-30T03:47:34.526597Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.   ,  1.   , -3.452, -0.159],\n",
       "       [ 0.   , -1.201,  0.053,  0.   ],\n",
       "       [ 0.   ,  0.   ,  1.331,  3.27 ]])"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.datasets import load_iris  # 利用iris数据作为演示数据集\n",
    "\n",
    "# 载入数据集\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# iris 数据集使用前需要被打乱顺序\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# 选择前100个观测点作为训练集\n",
    "# 剩下的50个观测点作为测试集\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "# 在进行逻辑回归前，我们需要先对变量进行缩放操作，否则回归系数大小无法比较\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "model = StandardScaler()\n",
    "model.fit(train_set) \n",
    "standardized_train = model.transform(train_set)\n",
    "standardized_test = model.transform(test_set)\n",
    "\n",
    "clf = LogisticRegression(penalty='l1', C = 0.7, \n",
    "                         random_state=1234, solver='liblinear') \n",
    "# 我们也可以将正则项设置为 'l2'\n",
    "# C控制正则效果的大小，C越大，正则效果越弱\n",
    "\n",
    "clf.fit(standardized_train, train_y)\n",
    "np.round(clf.coef_,3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.544196Z",
     "start_time": "2020-03-30T03:47:34.539789Z"
    }
   },
   "outputs": [],
   "source": [
    "selector = SelectFromModel(clf, prefit=True, threshold=1e-5)\n",
    "# 阈值被设置为1e-5，因此绝对系数低于1e-5的特征将被删除\n",
    "# 我们还可以设置max_features参数以选择最重要的前几个特征\n",
    "\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_test = selector.transform(test_set) #转换测试集\n",
    "\n",
    "assert np.array_equal(transformed_train, train_set[:,[1,2,3]]) \n",
    "# 选择第2个, 第3个及第4个变量\n",
    "assert np.array_equal(transformed_test, test_set[:,[1,2,3]]) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### LinearSVR/ LinearSVC 线性向量支持机"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.568504Z",
     "start_time": "2020-03-30T03:47:34.545714Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.254,  0.026,  0.026, -0.017,  0.032, -0.01 , -0.1  , -0.037])"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# LinearSVC 用于分类问题\n",
    "# LinearSVR 用于回归问题\n",
    "# 这里以LinearSVR为例\n",
    "\n",
    "import numpy as np\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.svm import LinearSVR\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# 在进行逻辑回归前，我们需要先对变量进行缩放操作，否则回归系数大小无法比较\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "model = StandardScaler()\n",
    "model.fit(train_set) \n",
    "standardized_train = model.transform(train_set)\n",
    "standardized_test = model.transform(test_set)\n",
    "\n",
    "clf = LinearSVR(C = 0.0001, random_state = 123) \n",
    "# C控制正则效果的大小，C越大，正则效果越弱\n",
    "\n",
    "clf.fit(standardized_train, train_y)\n",
    "np.round(clf.coef_,3) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:34.576429Z",
     "start_time": "2020-03-30T03:47:34.570070Z"
    }
   },
   "outputs": [],
   "source": [
    "selector = SelectFromModel(clf, prefit=True, threshold=1e-2)\n",
    "# 阈值被设置为1e-2，因此绝对系数低于1e-2的特征将被删除\n",
    "# 我们还可以设置max_features参数以选择最重要的前几个特征\n",
    "\n",
    "transformed_train = selector.transform(train_set) # 转换训练集\n",
    "transformed_test = selector.transform(test_set) #转换测试集\n",
    "\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,1,2,3,4,6,7]]) \n",
    "# 仅第6个变量被删去\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,1,2,3,4,6,7]]) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tree Based Methods 基于树模型的方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "机器学习的一大分支便是基于树的机器学习模型，例如随机森林，AdaBoost，Xgboost等。您可以在我的朋友和我撰写的一系列博客中找到有关这些基于树的机器学习模型的更多介绍[此处](https://github.com/YC-Coder-Chen/Tree-Math)。\n",
    "\n",
    "这些非参的树状模型在建立的过程中记录了每一个变量如何在树节点的分叉中逐步降低模型损失，并可以根据上述记录分析每个特征的特征重要性。而我们可以基于这特征重要性删去一些不重要的变量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:38.272357Z",
     "start_time": "2020-03-30T03:47:34.577834Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.52 , 0.045, 0.031, 0.026, 0.027, 0.139, 0.106, 0.107])"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 我们以随机森林为例\n",
    "import numpy as np\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "# 直接载入数据集\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
    "\n",
    "# 选择前15000个观测点作为训练集\n",
    "# 剩下的作为测试集\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# 在树类机器学习模型中，我们无需缩放变量操作\n",
    "\n",
    "clf = RandomForestRegressor(n_estimators = 50, random_state = 123)\n",
    "\n",
    "clf.fit(train_set, train_y)\n",
    "np.round(clf.feature_importances_, 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:38.541659Z",
     "start_time": "2020-03-30T03:47:38.274508Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAugAAALGCAYAAAAX/GTWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3X28pmVd7/3vjxlAR0JFlHxgRySKWIA2KLQFJ42nbJePRdkDmZFFhLt7LDH0ZpepFfuOMm2Hdoe4uxvDSA0znyc08WEmG0G3lHEn1i6stmwDFRj57T+uc3I5DsyamTXrOhbzfr9e81rXda7jPK/jmvM1w2cOznVe1d0BAADGsN+8JwAAAHyFQAcAgIEIdAAAGIhABwCAgQh0gH1QVdW857BQVT2gqlbPex4AIxDoAPcQVXVyVV213bbzquolOxj+vKr6nSV4zftX1QMWPN+vqg5YxH4HbPePhHck+abtxry4ql64p3MEWGmsVgDcc3wpyeeTZFqNvjPJF5IcNMXwvbr7i1X14CQvTvLdVfWCJD+Z5PYk90pyR5JOcnCSx3b3/6yqhyY5N8m9k3xdkgckOTTJQZkt9FyZ5L9Mczg+yauq6rbp+XFJ/mZ6fGSS66bHByY5K8mnp+e3TXNIVf1ykmumbVuX5HcGYAWxgg5wD1BV78ps0eXOadN3J9mU5MLM4vojSf6kqg5J8idJXpFkc3f/Wncf2d1HJ/lUkh/v7kd099dPcb5fkn9N8u4klyf5pcyC+4+6+zHdfVx3/5eqWl1V+3X3Xyb5uczi/aAkq5KsmX6tWrD9Bd396Zo5cJr3U6rqfkm+NclN07Y7F4wB2CdYQQdY4arqyMxWqs9L8tiq+qXufnGSK6vq7CRf392vmMaemuSvk7wryQer6pTuvu0uDp0kj07y/2a2mr3Nf0iytaqeuWDb6iQ/m+QDma20X9fdZ0//cPjpacwruvupVXVZZsG+7VivnR5/c5KzkzwmyW8nOSyz1fwfSvKZJE9d9G8KwAom0AFWvicl+dUk703y/CQvqao3JHlEZqvWVVU/m+Qfk7wuybOTvD3JezIL7Qd39z/u6MDdfW2SE6rqSUmOmTZ/Z2aXzmycnn+qu/9swW77J3lMVb00s+vKz5+2P2ra9pgkfzhtOzrJ/0jy2CSvT/LcJJu6+ylVtT7J1u6+ZPd+WwBWJoEOsPK9K7Prtx+SpJJ8a3d/38IBVfXJ7j5uenx+klOTnDHtc02Sw6ehv1NVX0jy5e5eu+AQz8rs+vaPJLklya1J/imzFfZvS7Iw0K/O7B8Bmcb/YpIfTvKqaduGfOXa8wcleX9mgf4PSd4yvQ7APkugA6x8n84scH8qs9Xtz1TV06fH2xxRVR9L8l+TPC/JF7r7zqr6Ur768pWf6O6NO3iNTvKMJE/OLOq3JvmWJPfNLMKTJFX1smnMHZn9Y+Hh0/FfldnKemX2j4mDqmrDgktvfiZJuvuPq+rlVfVfM/th1K6qH0lybXf/8G7+/gCsKAIdYOU7ObMf3nxvkrd29wunWx1u6O6PJbMV9MzusLI2yffnK5enLNYnMwvrm5Osy+yOMR/MLNA/u21Qd79oer1KclmSD3f3f66qNZmtrD+3u//q7l6ouy9IcoFLXIB9lUAHWOG6++okT6yqtZldg57MVrj/oKqe3t3XT+PuTPLhJNmVzymqqkdkdkvEW5N8OcnDMlshT2Yr4idU1ebufseC3R6c5MYkj6uqv0hyQJLNSa7drTcJsA8R6AD3HDX9ynT5ys9kdonI6nztbXVXV9W9Mvsh0q890GwF/s7u3trdf53ZdebbvndRkpt3tLI97fegJPdJ8peZ3dHl5Mx+OPVbkmypqk2ZXWv+1gV3kFmdZFVV7T+bfm/dwXG3Tv/IALhHE+gA9xwHTr+SJN397qo6Ksk/Z3arxIX2z+y2iM9McvMUzUly8XR5ygFJLqqqa5L8aZJ/y2z1PJmuQa+qbbc9rMw+wOi5mX0Q0eWZrbB/JLMQ/7ltYV1Vq5I8Jcnjk7x5u/kckOQ5SZ5TVdtW6DPdzvHAzK6d37zrvy0AK0t197znAMBeVFXV/rIHWDEEOgAADGT7axIBAIA5EugAADAQgQ4AAAPZ5+/icuihh/YRRxwx72kAAHAPt3nz5n/p7gfubNw+H+hHHHFENm3atPOBAACwB6rq04sZ5xIXAAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIGsnvcE5u0TN92Q4y4+a97TAABgL9uyfsO8p7AoVtABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAay7IFeVZdV1R9OjzdU1WU7GX9RVa1b8PxTe3eGAAAwP/NaQT9uu68AAECS1XN63dur6gFJ7kiypqremORBSa7t7nOr6v5JrkiyKkkl2bijg1TVRUn2T3JykoOTnJHk5iSXJXnY9Ph7u/sLe/PNAADAUpnXCvqWJN83fT0tyXXdfUqSB1fVsUnOSXJVd397ZhF/dx4+7XtlkidN+27p7ick+aMk37z9DlV1TlVtqqpNW2+5bcneFAAA7Kl5BfpfJjl7+npNkqdV1cYkRyZ5aJJvzCzek2TTTo51+fT1xiQHJDk6yYenbZcl+cj2O3T3pd29trvXrj7owN1+EwAAsNTmGegnTF9PSnJJd69LcmFmoX1jkkdPY4/fybFu3e75J6djJ8mLkjx3CeYLAADLYl6B/ndJ/jrJp5O8M8mZVXV1kucl+UySS5M8Y1pVP3gXj/2aJI+d9n1sktcvzZQBAGDvq+6e9xzmas3hh/RR558272kAALCXbVm/Ya6vX1Wbu3vtzsb5oCIAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIGsnvcE5u2Yw47MpvUb5j0NAABIYgUdAACGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgq+c9gXn7xE035LiLz5r3NOBubVm/Yd5TAACWiRV0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBLFmgV9UhVfVvVXWvXdjnoKr646p6f1W9rqpWL9V8AABgJVrKFfRTk9wrySm7sM95Sf6mu5+Q5MAk37uE8wEAgBVnKVesz0jyqiRnVNUJST7e3W+qqguSfCrJW5NcnuRBSa7t7nOTPD7Ja6f935/khKq6MsllSR6W5ObMov3OHWz7uSQbu3tjVZ09HeOI6ZhrkvxzkrO6e+sSvkcAANirlnIF/aQkL03y5CRXJDlz2n5Kkj9Nck6S67r7lCQPrqpjk3xdkluncV9IcvA0bsu0qv5HSb75Lrbdlfd19xOT3JTke3Y0oKrOqapNVbVp6y237e77BQCAJbckgT7F9qFJ3pjZKvYXkzysqg5OcnN335rkkUmeVlUbkxyZ5KFJPp/koOkw95meH53kw9O2y5J85C62LXTvBY83T18/Ns3la3T3pd29trvXrj7owF16rwAAsDct1Qr66Ule1t3rkvzm9PzDSZ6f5C3TmOuTXDKNuTDJjUk+lGTd9P2Tp30+meSEaduLkjz3LrbdnuSB07YzFszlcdPXx2R2aQ0AAKwYS3UN+ulJ1k+P35Pk3CQvyey68m+Ytr8mye9V1Y9mtlL+A0l+K8nrq+oDSf4ms0tj9k/yumml/V+TPDtJ7WDbw5O8uqqePG3b5oRp3D8luWqJ3h8AACyL6u55z2HJVNVFmX5wdLH7rDn8kD7q/NP22pxgKWxZv2HeUwAA9lBVbe7utTsbd4+673h3XzTvOQAAwJ7wSaIAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADCQ1fOewLwdc9iR2bR+w7ynAQAASaygAwDAUAQ6AAAMRKADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMZPW8JzBvn7jphhx38VnzngYD2LJ+w7ynAABgBR0AAEYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCC7FehVdVFV/eBSTqSqLtnu+fFVdfwuzmndUs4JAACW2zAr6N39/O02HT/9AgCAfcbqpThIVR2Y5LIkD0ny90l+NMmLkuyf5OQkByc5I8n/TnJlkkOS/G2S67r7ZdMxNnb3uunxy5M8bXr8Q9395Kq6KMnG7t5YVWdPL/3mJFckWZWkkmysqjVJLk/yoCTXdve5S/EeAQBgOSzVCvqPZxbbT0zyN0meM21/eHefklmUPynJ0ZkF/BOm771sRwfr7guSvCLJK7r7yXfzuuckuaq7vz3JHQu2XTe97oOr6tjtd6qqc6pqU1Vt2nrLbbv6XgEAYK9ZqkA/JsmHpscfTPKo6fHl09cbkxyQ5B+SfGuSq5P8xh683r2nr9+YZMv0eNP09ZFJnlZVG5McmeSh2+/c3Zd299ruXrv6oAP3YBoAALC0lirQP57kxOnxidPzJLl1u3FnJPml7j6pu39/J8f8YpI1SVJVleT2JA9ccJxkFv6Pnh5vu179+iSXTJfLXDiNAQCAFWFPAv0Xt10mktm17I+uqquTHJXZ9eg78tEkr6yq91TVhqr65rs5/juTPL2q/iKz69jfkuS8qvpvSf51GnNpkmdMq+UHT9tek+TMaS7PS/KZ3X6HAACwzKq7l+/Fqn48yfdndr34HUku7u6NyzaBHVhz+CF91PmnzXMKDGLL+g3zngIAcA9WVZu7e+3Oxi3JXVwWq7tfk9kKNwAAsAPD3AcdAAAQ6AAAMBSBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAANZPe8JzNsxhx2ZTes3zHsaAACQxAo6AAAMRaADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBAVs97AvP2iZtuyHEXnzXvaTBnW9ZvmPcUAACSWEEHAIChCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABjIkgV6VV1UVT+4iHHHV9XxO9h+yWLG7eT11y12PAAAjGgeK+jHT7++Snc/fzHjAADgnmz13jpwVR2U5I1J7pPkU939o1X18iRPm77/Q9395AXjN3b3uunx14yrqouSbOzujVV19rTbm5NckWRVkkqysarWJLk8yYOSXNvd5+6t9wgAAEttrwV6kgcneWWSdyX5s6o6rLsvqKrrk6S7L7urHRc7Lsk5Sa7q7kuq6p0Ltl3X3RdV1ZVVdWx3f2zhTlV1zjQu+99vze69OwAA2Av25iUudyR5bpLfT3JIknsv4bG3Hesbk2yZHm+avj4yydOqamOSI5M8dPudu/vS7l7b3WtXH3TgEk4LAAD2zN4M9B/L7BKX709y64LtX0yyJkmqqu5m/+3H3Z7kgdP3zpi+3pjk0dPjbderX5/kkulymQunMQAAsCIsdaD/YlVtqqpNma2gX5DkPdP3tq1kvzPJ06vqL5KcfDfH2n7cW5KcV1X/Lcm/TmMuTfKMabX84Gnba5KcWVVXJ3leks8syTsDAIBlUN097znM1ZrDD+mjzj9t3tNgzras3zDvKQAA93BVtbm71+5snA8qAgCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYyOp5T2DejjnsyGxav2He0wAAgCRW0AEAYCgCHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABrJ63hOYt0/cdEOOu/iseU9jr9myfsO8pwAAwC6wgg4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMJAlC/SquqyqnjA9vrCqzl6qYy94jeOr6v9f6uMCAMAoVtoK+ulJHlZVj5j3RAAAYG/Ym4F+SFX9SVW9r6ouSf59lf2I6fFFVbWuqh5UVe+tqvdX1e9M3zusqt5WVR+oqgsWHPP0JK9KcsaCcVdX1Yer6vVV9dy72RcAAIa31IH+yqramOTHkrwoyYbuPjnJfavqjLvY5+Qk13b3E5JcXVX7JbkgyRu6+9uSPLWqHlBVByU5JMlrMwv1JPm2JG9P8rQk9+vu1+5o3+1fsKrOqapNVbVp6y23LdFbBwCAPbfUgX5ed69L8rtJ1iT50LT9Q0ketd3Ye09f35ZkVVW9M8nR3X1nkkcm+ckp9u+T5CFJnpTk0CS/leSkqjowyQ1JnpnkDUl+Yzrejvb9Kt19aXev7e61qw86cE/fMwAALJm9eYnLrUlOnB6fmOTjSW5P8sCqWpXk1Ol7JyV5fXefmuRJVfVNSa5P8sIp9l+R5H9ltmr+M9O2t2a28v49SZ7T3U/o7ndNx9vRvgAAsCLszUB/eZKzqur9SW7u7nck2ZDkV5L8dpJPTeNuSPKrVXVNks8m+XRmYb2+qv4is+vNb8os6DdO+7xn2r45yR9V1bur6vKqeuhd7AsAACtCdfe857DbquqiJP8xyZeTbE3y89398V05xprDD+mjzj9tL8xuDFvWb5j3FAAASFJVm7t77c7GrV6Oyewt3X3RvOcAAABLaaXdBx0AAO7RBDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBAVs97AvN2zGFHZtP6DfOeBgAAJLGCDgAAQxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwkNXznsC8feKmG3LcxWfNexpfY8v6DfOeAgAAc2AFHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIEsa6FV1SFX9W1Xdaxf2uayqPlpV11TVFVW1/1LOCQAAVpKlXkE/Ncm9kpyyi/ud190nJbklyXcs8ZwAAGDFWOpAPyPJq5KcUVW/UFVPTZKquqCqnlVVa6rqjVV1dVW9auGOVVVJDkpy+7QS/ydV9b6qumT6/o623VBV76+qN1TVX1XVd1bV0dO2D1bVi5f4/QEAwF611IF+UpKXJnlykiuSnDltPyXJnyY5J8l13X1KkgdX1bHT91+Z5O+S3JTkPUlelGRDd5+c5L5VdcZdbKskP5LksCT/V5ITkjwlyZXdfeJ0TAAAWDGWLNCn2D40yRuTHJHki0keVlUHJ7m5u29N8sgkT6uqjUmOTPLQaffzkvx2kr/t7k5yTJIPTd/7UJJH3cW2Tyf5cmYh/uXMgv31SY6tqncmue9dzPWcqtpUVZu23nLbUrx9AABYEku5gn56kpd197okvzk9/3CS5yd5yzTm+iSXTGMuTHLjgv1/J8mPVdWqJB9PcuK0/cTp+Y627ci6JL+c2eU2L9jRD51296Xdvba7164+6MBdfqMAALC3LHWgv2d6/J7MAvmKzAL9qmn7a5KcWVVXJ3leks9s27m7Pzft94wkL09yVlW9P7PV93fcxbYduSHJ65J8IMnbuvuOpXuLAACwd9XsipJ915rDD+mjzj9t3tP4GlvWb5j3FAAAWEJVtbm71+5snA8qAgCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYyOp5T2DejjnsyGxav2He0wAAgCRW0AEAYCgCHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABrJ63hOYt0/cdEOOu/isuc5hy/oNc319AADGYQUdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgOw30qrqoqv5HVV1dVe+uqofs6YtW1cZFjDmiqtbtYPsle/r6AAAwqsWuoP9yd5+S5PeSnLcX57PQEUnWbb+xu5+/TK8PAADLbvUujr9/ki9V1R8keUiSv0/yo0lelOTxSdYk+eckZyX5wSTp7sumlfB13X3R9gesqlVJXp/kG5L8S5JnJvmp6bj3m/Z9Vnf/8zR+Y3evW7D/K5Mcn+TmJD+c5HuSHDf9+vok39vd1+3i+wQAgLlY7Ar6L1TV1UlOzCyEr+vuJyb5myTPmca8b9p2U2aRvFgPSPLWJE9M8vkkj+3u30jy/CSXdfe6bXG+var6riT36u6Tk/xRkp+fvnVCktOTvCLJd+/CXAAAYK526RKX7n52kkcm+dC0/YNJHjU93jx9/Vhml6csdO+7OfYdSb4ryRVJjtzJ2O0dcxdz+YPuviPJjUkO2H6nqjqnqjZV1aatt9y2Cy8HAAB71+7cxeXjma2kZ/r68enx46avj0nyqSS3J3ngtO3Muzne05NcN339hwXbv5jZJTOpqtrFudx6d2+guy/t7rXdvXb1QQfe3VAAAFhWuxPor03y6OmSl6OSXDZtP2G6O8v9klyV5D1J/lNVvSrJqrs53l8k+b4k709ySJKHTts/muSRVfW+6ftfo7vfmuSLVfX+JM9I8mu78X4AAGAY1d17fpCqi5Js7O6Ne3ywZbbm8EP6qPNPm+sctqzfMNfXBwBg76uqzd29dmfjdvUuLju0o7uzAAAAu84niQIAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBABDoAAAxEoAMAwEAEOgAADESgAwDAQAQ6AAAMRKADAMBABDoAAAxEoAMAwEBWz3sC83bMYUdm0/oN854GAAAksYIOAABDEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwEIEOAAADEegAADCQ1fOewLx94qYbctzFZ83t9bes3zC31wYAYDxW0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCCn2A8bAAAM8klEQVQCHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABrLbgV5Vh1TVv1XVvXZhn8uq6qNVtbGq3lBVqxaxz0VVtW535wkAACvJnqygn5rkXklO2cX9zuvudUk+l+S0PXh9AAC4x1m9B/uekeRVSc6oqhOSfLy731RVFyT5VJK3Jrk8yYOSXNvd5263/6FJbq2qNduPq6r7J7kiyaoklWRjklTVxiQfSXJsd59eVZuTfDbJ7Um+PsnvJbkyyRuS7D/N6Sf24D0CAMCy2pMV9JOSvDTJkzOL6TOn7ack+dMk5yS5rrtPSfLgqjp2+v4rq+qTSR6S5Jq7GHdOkqu6+9uT3LHgNU9Mck13nz49X5PkWUmOTfIDSR6f5OTMQv8JSa6uKtfZAwCwYuxWvE4RfWiSNyY5IskXkzysqg5OcnN335rkkUmeNq16H5nkodPu5yU5JrOV8BfexbhvTLJlGr9pwUtf191XLnh+U3ffkuTTSb6c2Wr725Ksqqp3Jjm6u+/cwfzPqapNVbVp6y237c5vAQAA7BW7u7p8epKXTdeS/+b0/MNJnp/kLdOY65NcMo25MMmN23aeovlzSb7uLsbdmOTR0/DjF7zuLYuY20lJXt/dpyZ5UlV90/YDuvvS7l7b3WtXH3TgYt4vAAAsi929Bv30JOunx+9Jcm6SlyR5f5JvmLa/JsnvVdWPJvl8ZpegJLNLXL4wPf6BJP+yg3GXJrmiqp6Z2bXku+KGJK+rqv2T/FNmq+sAALAiVHfPew5ztebwQ/qo8+d3M5kt6zfM7bUBAFg+VbW5u9fubJwfoAQAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGIhABwCAgQh0AAAYiEAHAICBCHQAABiIQAcAgIEIdAAAGMjqeU9g3o457MhsWr9h3tMAAIAkVtABAGAoAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgIAIdAAAGItABAGAgAh0AAAYi0AEAYCACHQAABiLQAQBgINXd857DXFXVvyW5ft7zYFkcmuRf5j0JloVzvW9wnvcdzvW+YV84z9/Q3Q/c2aDVyzGTwV3f3WvnPQn2vqra5FzvG5zrfYPzvO9wrvcNzvNXuMQFAAAGItABAGAgAj25dN4TYNk41/sO53rf4DzvO5zrfYPzPNnnf0gUAABGYgUdgBWvqg6pqlOr6tB5zwVgT+0zgV5Vv1tV11TVhXsyhvEt8lwfVlXvW855sbR2dp6r6r5V9baqekdV/XFVHbDcc2RpLOJc3z/JVUkel+S9VbXTW5gxpsX+d3j6O/yjyzUvltYi/kyvrqobq2rj9OtblnuO87ZPBHpVPT3Jqu4+KcmRVXXU7oxhfIs81/dP8rok91nu+bE0Fvnn9dlJ/p/uPi3JPyU5YznnyNJY5Lk+NsnPdvcvJ3l7kscu5xxZGrv43+GLk9x7eWbGUtqFP9N/0N3rpl/XLu8s52+fCPQk65L84fT4HUmesJtjGN+67Pw8fjnJ9yX5/DLNiaW3Ljs5z9396u5+5/T0gUk+uzxTY4mty87P9Z939wer6pTMVtGvWb7psYTWZRH/Ha6qJyW5NbN/eLPyrMvOz/OJSb6rqj48rbbvc5/bs68E+n2S/MP0+H8lOWw3xzC+nZ7H7v58d//vZZ0VS23Rf16r6qQk9+/uDy7HxFhyizrXVVWZ/cP7c0nuWJ6pscR2eq6nS9VenOSFyzgvltZi/kx/JMl3dPfjkuyf5DuXaW7D2FcC/ZZ85X+FHZQdv+/FjGF8zuO+YVHnuaoOSfLKJM9Zpnmx9BZ1rnvm3CQfS/LdyzQ3ltZizvULk7y6u29etlmx1BZznj/W3f84Pd6UZJ+77HhfiZfN+cr/Qjkuyd/t5hjG5zzuG3Z6nqeVtiuSXNDdn16+qbHEFnOuf76qfnh6er8k4m1lWszf39+R5Nyq2pjk+Kp67fJMjSW0mPP8+qo6rqpWJXlqki3LNLdh7BP3Qa+qg5O8L8m7k5yZ5Kwkz+ruC+9mzIkug1h5FnOuF4zd2N3rlneGLIVF/pn+ySQvy1f+Yv/t7n7Dcs+VPbPIc33/zK5pPTDJdUnO7X3hP273MLvy9/c03t/hK9Ai/0x/c5L/L0kleUt3/8I85jpP+0SgJ//+F/ipSa7u7h3+YMlixjA+53Hf4DzvO5zrfYdzvW9wnndunwl0AABYCfaVa9ABAGBFEOgAADAQgQ7AsqmqVXf3oSPTR3yvWjD265ZvdgBjEOgAK0xVvb2q1lTVfjXzjilmt4VtVdWbq+ph0+0mF+777wE8Pf/Y9PW0qvq7qto4/frHqjp6J/P4tar67gXPH15VZ1bV06vqx6vqxVV1WVX9aVXdZxq2Lsmbq+pNVbWlqj5UVR+YHr8pyZuTPH4a+41JLp2O/ZCqOriq3jXdBQLgHmuf++hUgJWqqh6e5AeS3NHdX6iqM5O8IMm3JHl7klVVdXqSFyX5aGYfBvLeqrojydrMPvBjVZJfrap3J/m6JFur6kFJDkhyWXdfNL3Wa5PcMb3mS5Lcntktz17d3ZunKd06bU9V7Tcd436ZfTrg45P8a5Lnd/fN0z8m9uvud2d2e7VU1U8n+fskX0ry8O7+re3e8pemcQ9J8jNJ3jptu32PfzMBBuYuLgArRFU9PcnzkzwssyD/me6+o6qu6u7vmsZ8T5JnJ/ntJDd2999O2/+qu49fcKynJvm/kxyb2b3ir0nyz9sF+kuTfDqz/1bcWVW/ktn/eT02yR1JHpHk80n+e2b3H1+fZOv0Ev8hyZ2ZBXgyWxD69e5+e1X9epJvS3LfJLdN4+6d5HNJtnT386rqpMw+EfRbMvvgoUMy+8jv45L8VWYfF/7t3S3WgXscK+gAK8e3JXlekl/t7p+sql+tqrVJUlWfyixw/zDJT2W2Sv20qjqpu6/ZdoDp8pY7u/tNVXVDkg8k+dYk/zHJ5VW17RP+HpVZoB+Q5L9X1Y8keXSS70ly7+6+paouSvLB7v6z6djvTbIhyfWZhfzWJB/LLOSf3d13TMfemuQ3Movvz2YW6YdntsL/5GnM0UluSvLgJH+Z5CndfWpVXZXkmd39pT3/7QQYk0AHWDl+JbPLRlJV35vkxd192/T8oswuAfnLJH+Q5H3dfUNVXZnk+GnMuzL7e/+8JNcm+f7MYvlPkvx+ksu3W0FPd99WVa9OckWSLyR5QJLfTfKftp9cd3+5qr49yf2TPDSzlfE1SR6zIM6T5PIk3zTN4UWZreT/z+l7l05fb8/sI8GPT/KWJH9dVbXtAFV1gNVz4J5KoAOsHA9KckGSJyR5Z5Lzq+opSTrJEZlF86X56mu0/32lubu/Y9vjqnpgkhOSfDLJi5N8Z5Kzq2rdNOSRma2gp7vfW1VnJ7m1uz87/QDpydtPbgrozyW5KrPV/q1JPpzk4VW1/3Q5zkuTPC6zeP/WJH+b5MczW53/5HSYq7v7F6vqYdPr/900r7fnK8G+dZozwD2OQAdYOfbL7C4nj+ju35ju0FLd/SvTCvpVma1Kb05y9k6OdUySC5Nc0t0fne6ysv/2K+jT42OTHJRk9RTmL8ns2vMnb3fM/ZP8dJIvZ/YPhtszuzb9JzL74dQ7uvvC6ZjPy+w69Rdkdj35s7r7x+5qst19WZLLXOIC7AsEOsAK0d3XJklVvWB6fntVff2CWxhuuySlZsO++haL076rkuzX3X8+Pd922ch++doV9JdPcf5bSZ45jTmvu9837btfZqv323x/kh/KbHX78MxWyY+d9jsmya8vGPvaJDckeXVmYX9tVZ2V5G3d/b+3TXeaYk1z/vKC97Hf9H7vXMzvHcBK4j7oACvP/lV1YJJ0939O8oNJzshX7phyYGa3UHx7kluma8//Zfr6zszutrLNmunrttssruvudZldz35AkkOT/GB3f7a7/6m7fyFJquqCzC4x+cT0/HGZxfmXMgv026df2x5/Z1U9saoOq6o/n47/jCTnd/fRmf3w6/2SXDndVnHb+zgws8h/e1X9WWYLS29K8mdJTt2z30aAMbnNIsAKV1XVu/mXeVXdv7s/V1X7Z3aJyxcWud9+u7t6XVWrFq6GA/DVBDoAAAzEJS4AADAQgQ4AAAMR6AAAMBCBDgAAAxHoAAAwkP8DsrOLzYE0ZAEAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 864x864 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 可视化特征重要性\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['font.sans-serif']=['SimHei']\n",
    "%matplotlib inline\n",
    "importances = clf.feature_importances_\n",
    "indices = np.argsort(importances)\n",
    "plt.figure(figsize=(12,12))\n",
    "plt.title('特征重要性')\n",
    "plt.barh(range(len(indices)), importances[indices], color='seagreen', align='center')\n",
    "plt.yticks(range(len(indices)),np.array(dataset.feature_names)[indices])\n",
    "plt.xlabel('特征相对重要性');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:38.568452Z",
     "start_time": "2020-03-30T03:47:38.543519Z"
    }
   },
   "outputs": [],
   "source": [
    "selector = SelectFromModel(clf, prefit=True, threshold='median')\n",
    "# 阈值被设定为'median', 即以特征重要性的中位数作为阈值，大约为0.076\n",
    "# 我们还可以设置max_features参数以选择最重要的前几个特征\n",
    "\n",
    "transformed_train = selector.transform(train_set)\n",
    "transformed_test = selector.transform(test_set)\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,5,6,7]]) \n",
    "# 选择来第1个，第6个, 第7个及第8个特征\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,5,6,7]]) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "720px",
    "left": "20.8789px",
    "top": "180px",
    "width": "360px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
